cyclismo R tutorial




1. Input

Contents

Here we explore how to define a data set in an R session. Only two commands are explored. The first is for simple assignment of data, and the second is for reading in a data file. There are many ways to read data into an R session, but we focus on just two to keep it simple.

1.1. Assignment

The most straight forward way to store a list of numbers is through an assignment using , .toc-backrefthe c command. (c stands for “combine.”) The idea is that a list of numbers is stored under a given name, and the name is used to refer to the data. A list is specified with the c command, and assignment is specified with the “<-” symbols. Another term used to describe the list of numbers is to call it a “vector.”

The numbers within the c command are separated by commas. As an example, we can create a new variable, called “bubba” which will contain the numbers 3, 5, 7, and 9:

> bubba <- c(3,5,7,9)
>

When you enter this command you should not see any output except a new command line. The command creates a list of numbers called “bubba.” To see what numbers is included in bubba type “bubba” and press the enter key:

> bubba
[1] 3 5 7 9

If you wish to work with one of the numbers you can get access to it using the variable and then square brackets indicating which number:

> bubba[2]
[1] 5
> bubba[1]
[1] 3
> bubba[0]
numeric(0)
> bubba[3]
[1] 7
> bubba[4]
[1] 9

Notice that the first entry is referred to as the number 1 entry, and the zero entry can be used to indicate how the computer will treat the data. You can store strings using both single and double quotes, and you can store real numbers.

You now have a list of numbers and are ready to explore. In the chapters that follow we will examine the basic operations in R that will allow you to do some of the analyses required in class.

1.2. Reading a CSV file

Unfortunately, it is rare to have just a few data points that you do not mind typing in at the prompt. It is much more common to have a lot of data points with complicated relationships. Here we will examine how to read a data set from a file using the read.csv function but first discuss the format of a data file.

We assume that the data file is in the format called “comma separated values” (csv). That is, each line contains a row of values which can be numbers or letters, and each value is separated by a comma. We also assume that the very first row contains a list of labels. The idea is that the labels in the top row are used to refer to the different columns of values.

First we read a very short, somewhat silly, data file. The data file is called simple.csv and has three columns of data and six rows. The three columns are labeled “trial,” “mass,” and “velocity.” We can pretend that each row comes from an observation during one of two trials labeled “A” and “B.” A copy of the data file is shown below and is created in defiance of Werner Heisenberg:

silly.csv
trial mass velocity
A 10 12
A 11 14
B 5 8
B 6 10
A 10.5 13
B 7 11

The command to read the data file is read.csv. We have to give the command at least one arguments, but we will give three different arguments to indicate how the command can be used in different situations. The first argument is the name of file. The second argument indicates whether or not the first row is a set of labels. The third argument indicates that there is a comma between each number of each line. The following command will read in the data and assign it to a variable called “heisenberg:”

> heisenberg <- read.csv(file="simple.csv",head=TRUE,sep=",")
> heisenberg
trial mass velocity
1     A 10.0       12
2     A 11.0       14
3     B  5.0        8
4     B  6.0       10
5     A 10.5       13
6     B  7.0       11
> summary(heisenberg)
trial      mass          velocity
A:3   Min.   : 5.00   Min.   : 8.00
B:3   1st Bu.: 6.25   1st Qu.:10.25
      Median : 8.50   Median :11.50
      Mean   : 8.25   Mean   :11.33
      3rd Qu.:10.38   3rd Qu.:12.75
      Max.   :11.00   Max.   :14.00

(Note that if you are using a Microsoft system the file naming convention is different from what we use here. If you want to use a backslash it needs to be escaped, i.e. use two backslashes together “\.” Also you can specify what folder to use by clicking on the “File” option in the main menu and choose the option to specify your working directory.)

To get more information on the different options available you can use the help command:

> help(read.csv)

If R is not finding the file you are trying to read then it may be looking in the wrong folder/directory. If you are using the graphical interface you can change the working directory from the file menu. If you are not sure what files are in the current working directory you can use the dir() command to list the files and the getwd() command to determine the current working directory:

> dir()
[1] "fixedWidth.dat" "simple.csv"     "trees91.csv"    "trees91.wk1"
[5] "w1.dat"
> getwd()
[1] "/home/black/write/class/stat/stat383-13F/dat"

The variable “heisenberg” contains the three columns of data. Each column is assigned a name based on the header (the first line in the file). You can now access each individual column using a “$” to separate the two names:

> heisenberg$trial
[1] A A B B A B
Levels: A B
> heisenberg$mass
[1] 10.0 11.0  5.0  6.0 10.5  7.0
> heisenberg$velocity
[1] 12 14  8 10 13 11

If you are not sure what columns are contained in the variable you can use the names command:

> names(heisenberg)
[1] "trial"    "mass"     "velocity"

We will look at another example which is used throughout this tutorial. The data file was created by a group at Oak Ridge National Laboratory, and I converted it to a CSV file to make it easier to work with. The original data is given in an excel spreadsheet, and the CSV file, trees91.csv , was created by deleting the top set of rows and saving it as a “csv” file. This is an option to save within excel. (You should save the file on your computer.) It is a good idea to open this file in a spreadsheet and look at it. This will help you make sense of how R stores the data.

The original spreadsheet is located at http://cdiac.ornl.gov/ftp/ndp061a/trees91.wk1 . A description of the data file is located at http://cdiac.ornl.gov/ftp/ndp061a/ndp061a.txt .

The data is used to indicate an estimate of biomass of ponderosa pine in a study performed by Dale W. Johnson, J. Timothy Ball, and Roger F. Walker who are associated with the Biological Sciences Center, Desert Research Institute, P.O. Box 60220, Reno, NV 89506 and the Environmental and Resource Sciences College of Agriculture, University of Nevada, Reno, NV 89512. The data is consists of 54 lines, and each line represents an observation. Each observation includes measurements and markers for 28 different measurements of a given tree. For example, the first number in each row is a number, either 1, 2, 3, or 4, which signifies a different level of exposure to carbon dioxide. The sixth number in every row is an estimate of the biomass of the stems of a tree. Note that the very first line in the file is a list of labels used for the different columns of data.

The data can be read into a variable called “tree” in using the read.csv command:

> tree <- read.csv(file="trees91.csv",header=TRUE,sep=",");

This will create a new variable called “tree.” If you type in “tree” at the prompt and hit enter, all of the numbers stored in the variable will be printed out. Try this, and you should see that it is difficult to make any sense out of the numbers.

There are many different ways to keep track of data in R. When you use the read.csv command R uses a specific kind of variable called a “data frame.” All of the data are stored within the data frame as separate columns. If you are not sure what kind of variable you have then you can use the attributes command. This will list all of the things that R uses to describe the variable:

> attributes(tree)
$names
[1] "C"      "N"      "CHBR"   "REP"    "LFBM"   "STBM"   "RTBM"   "LFNCC"
[9] "STNCC"  "RTNCC"  "LFBCC"  "STBCC"  "RTBCC"  "LFCACC" "STCACC" "RTCACC"
[17] "LFKCC"  "STKCC"  "RTKCC"  "LFMGCC" "STMGCC" "RTMGCC" "LFPCC"  "STPCC"
[25] "RTPCC"  "LFSCC"  "STSCC"  "RTSCC"

$class
[1] "data.frame"

$row.names
[1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15"
[16] "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30"
[31] "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42" "43" "44" "45"
[46] "46" "47" "48" "49" "50" "51" "52" "53" "54"

The first thing that R stores is a list of names which refer to each column of the data. For example, the first column is called “C”, the second column is called “N.” Tree is of type data.frame. Finally, the rows are numbered consecutively from 1 to 54. Each column has 54 numbers in it.

If you know that a variable is a data frame but are not sure what labels are used to refer to the different columns you can use the names command:

> names(tree)
[1]  "C"      "N"      "CHBR"   "REP"    "LFBM"   "STBM"   "RTBM"   "LFNCC"
[9]  "STNCC"  "RTNCC"  "LFBCC"  "STBCC"  "RTBCC"  "LFCACC" "STCACC" "RTCACC"
[17] "LFKCC"  "STKCC"  "RTKCC"  "LFMGCC" "STMGCC" "RTMGCC" "LFPCC"  "STPCC"
[25] "RTPCC"  "LFSCC"  "STSCC"  "RTSCC"

If you want to work with the data in one of the columns you give the name of the data frame, a “$” sign, and the label assigned to the column. For example, the first column in tree can be called using “tree$C:”

> tree$C
[1]  1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3
[39] 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4

1.3. Brief Note on Fixed Width Files

There are many ways to read data using R. We only give two examples, direct assignment and reading csv files. However, another way deserves a brief mention. It is common to come across data that is organized in flat files and delimited at preset locations on each line. This is often called a “fixed width file.”

The command to deal with these kind of files is read.fwf. Examples of how to use this command are not explored here, but a brief example is given. If you would like more information on how to use this command enter the following command:

> help(read.fwf)

The read.fwf command requires at least two options. The first is the name of the file and the second is a list of numbers that gives the length of each column in the data file. A negative number in the list indicates that the column should be skipped. Here we give the command to read the data file fixedWidth.dat . In this data file there are three columns. The first colum is 17 characters wide, the second column is 15 characters wide, and the last column is 7 characters wide. In the example below we use the optional col.names option to specify the names of the columns:

 > a = read.fwf('fixedWidth.dat',widths=c(-17,15,7),col.names=c('temp','offices'))
 > a
  temp offices
1 17.0      35
2 18.0     117
3 17.5      19
4 17.5      28

2. Basic Data Types

Contents

We look at some of the ways that R can store and organize data. This is a basic introduction to a small subset of the different data types recognized by R and is not comprehensive in any sense. The main goal is to demonstrate the different kinds of information R can handle. It is assumed that you know how to enter data or read data files which is covered in the first chapter.

2.1. Variable Types

2.1.1. Numbers

The way to work with real numbers has already been covered in the first chapter and is briefly discussed here. The most basic way to store a number is to make an assignment of a single number:

> a <- 3
>

The “<-” tells R to take the number to the right of the symbol and store it in a variable whose name is given on the left. You can also use the “=” symbol. When you make an assignment R does not print out any information. If you want to see what value a variable has just type the name of the variable on a line and press the enter key:

> a
[1] 3

This allows you to do all sorts of basic operations and save the numbers:

> b <- sqrt(a*a+3)
> b
[1] 3.464102

If you want to get a list of the variables that you have defined in a particular session you can list them all using the ls command:

> ls()
[1] "a" "b"

You are not limited to just saving a single number. You can create a list (also called a “vector”) using the c command:

> a <- c(1,2,3,4,5)
> a
[1] 1 2 3 4 5
> a+1
[1] 2 3 4 5 6
> mean(a)
[1] 3
> var(a)
[1] 2.5

You can get access to particular entries in the vector in the following manner:

> a <- c(1,2,3,4,5)
> a[1]
[1] 1
> a[2]
[1] 2
> a[0]
numeric(0)
> a[5]
[1] 5
> a[6]
[1] NA

Note that the zero entry is used to indicate how the data is stored. The first entry in the vector is the first number, and if you try to get a number past the last number you get “NA.”

Examples of the sort of operations you can do on vectors is given in a next chapter.

To initialize a list of numbers the numeric command can be used. For example, to create a list of 10 numbers, initialized to zero, use the following command:

> a <- numeric(10)
> a
[1] 0 0 0 0 0 0 0 0 0 0

If you wish to determine the data type used for a variable the type command:

> typeof(a)
[1] "double"

2.1.2. Strings

You are not limited to just storing numbers. You can also store strings. A string is specified by using quotes. Both single and double quotes will work:

> a <- "hello"
> a
[1] "hello"
> b <- c("hello","there")
> b
[1] "hello" "there"
> b[1]
[1] "hello"

The name of the type given to strings is character,

> typeof(a)
[1] "character"
> a = character(20)
> a
[1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""

2.1.3. Factors

Another important way R can store data is as a factor. Often times an experiment includes trials for different levels of some explanatory variable. For example, when looking at the impact of carbon dioxide on the growth rate of a tree you might try to observe how different trees grow when exposed to different preset concentrations of carbon dioxide. The different levels are also called factors.

Assuming you know how to read in a file, we will look at the data file given in the first chapter. Several of the variables in the file are factors:

> summary(tree$CHBR)
A1  A2  A3  A4  A5  A6  A7  B1  B2  B3  B4  B5  B6  B7  C1  C2  C3  C4  C5  C6
3   1   1   3   1   3   1   1   3   3   3   3   3   3   1   3   1   3   1   1
C7 CL6 CL7  D1  D2  D3  D4  D5  D6  D7
1   1   1   1   1   3   1   1   1   1

Because the set of options given in the data file corresponding to the “CHBR” column are not all numbers R automatically assumes that it is a factor. When you use summary on a factor it does not print out the five point summary, rather it prints out the possible values and the frequency that they occur.

In this data set several of the columns are factors, but the researchers used numbers to indicate the different levels. For example, the first column, labeled “C,” is a factor. Each trees was grown in an environment with one of four different possible levels of carbon dioxide. The researchers quite sensibly labeled these four environments as 1, 2, 3, and 4. Unfortunately, R cannot determine that these are factors and must assume that they are regular numbers.

This is a common problem and there is a way to tell R to treat the “C” column as a set of factors. You specify that a variable is a factor using the factor command. In the following example we convert tree$C into a factor:

> tree$C
[1]  1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3
[39] 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4
> summary(tree$C)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  1.000   2.000   2.000   2.519   3.000   4.000
> tree$C <- factor(tree$C)
> tree$C
[1]  1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3
[39] 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4
Levels: 1 2 3 4
> summary(tree$C)
1  2  3  4
8 23 10 13
> levels(tree$C)
[1] "1" "2" "3" "4"

Once a vector is converted into a set of factors then R treats it differently. A set of factors have a discrete set of possible values, and it does not make sense to try to find averages or other numerical descriptions. One thing that is important is the number of times that each factor appears, called their “frequencies,” which is printed using the summary command.

2.1.4. Data Frames

Another way that information is stored is in data frames. This is a way to take many vectors of different types and store them in the same variable. The vectors can be of all different types. For example, a data frame may contain many lists, and each list might be a list of factors, strings, or numbers.

There are different ways to create and manipulate data frames. Most are beyond the scope of this introduction. They are only mentioned here to offer a more complete description. Please see the first chapter for more information on data frames.

One example of how to create a data frame is given below:

> a <- c(1,2,3,4)
> b <- c(2,4,6,8)
> levels <- factor(c("A","B","A","B"))
> bubba <- data.frame(first=a,
                      second=b,
                      f=levels)
> bubba
first second f
1     1      2 A
2     2      4 B
3     3      6 A
4     4      8 B
> summary(bubba)
    first          second    f
Min.   :1.00   Min.   :2.0   A:2
1st Qu.:1.75   1st Qu.:3.5   B:2
Median :2.50   Median :5.0
Mean   :2.50   Mean   :5.0
3rd Qu.:3.25   3rd Qu.:6.5
Max.   :4.00   Max.   :8.0
> bubba$first
[1] 1 2 3 4
> bubba$second
[1] 2 4 6 8
> bubba$f
[1] A B A B
Levels: A B

2.1.5. Logical

Another important data type is the logical type. There are two predefined variables, TRUE and FALSE:

> a = TRUE
> typeof(a)
[1] "logical"
> b = FALSE
> typeof(b)
[1] "logical"

The standard logical operators can be used:

, .toc-backref
< less than
> great than
<= less than or equal
>= greater than or equal
== equal to
!= not equal to
| entry wise or
|| or
! not
& entry wise and
&& and
xor(a,b) exclusive or

Note that there is a difference between operators that act on entries within a vector and the whole vector:

> a = c(TRUE,FALSE)
> b = c(FALSE,FALSE)
> a|b
[1]  TRUE FALSE
> a||b
[1] TRUE
> xor(a,b)
[1]  TRUE FALSE

There are a large number of functions that test to determine the type of a variable. For example the is.numeric function can determine if a variable is numeric:

> a = c(1,2,3)
> is.numeric(a)
[1] TRUE
> is.factor(a)
[1] FALSE

2.2. Tables

Another common way to store information is in a table. Here we look at how to define both one way and two way tables. We only look at how to create and define tables; the functions used in the analysis of proportions are examined in another chapter.

2.2.1. One Way Tables

The first example is for a one way table. One way tables are not the most interesting example, but it is a good place to start. One way to create a table is using the table command. The arguments it takes is a vector of factors, and it calculates the frequency that each factor occurs. Here is an example of how to create a one way table:

> a <- factor(c("A","A","B","A","B","B","C","A","C"))
> results <- table(a)
> results
a
A B C
4 3 2
> attributes(results)
$dim
[1] 3

$dimnames
$dimnames$a
[1] "A" "B" "C"


$class
[1] "table"

> summary(results)
Number of cases in table: 9
Number of factors: 1

If you know the number of occurrences for each factor then it is possible to create the table directly, but the process is, unfortunately, a bit more convoluted. There is an easier way to define one-way tables (a table with one row), but it does not extend easily to two-way tables (tables with more than one row). You must first create a matrix of numbers. A matrix is like a vector in that it is a list of numbers, but it is different in that you can have both rows and columns of numbers. For example, in our example above the number of occurrences of “A” is 4, the number of occurrences of “B” is 3, and the number of occurrences of “C” is 2. We will create one row of numbers. The first column contains a 4, the second column contains a 3, and the third column contains a 2:

> occur <- matrix(c(4,3,2),ncol=3,byrow=TRUE)
> occur
     [,1] [,2] [,3]
[1,]    4    3    2

At this point the variable “occur” is a matrix with one row and three columns of numbers. To dress it up and use it as a table we would like to give it labels for each columns just like in the previous example. Once that is done we convert the matrix to a table using the as.table command:

> colnames(occur) <- c("A","B","C")
> occur
     A B C
[1,] 4 3 2
> occur <- as.table(occur)
> occur
  A B C
A 4 3 2
> attributes(occur)
$dim
[1] 1 3

$dimnames
$dimnames[[1]]
[1] "A"

$dimnames[[2]]
[1] "A" "B" "C"


$class
[1] "table"

2.2.2. Two Way Tables

If you want to add rows to your table just add another vector to the argument of the table command. In the example below we have two questions. In the first question the responses are labeled “Never,” “Sometimes,” or “Always.” In the second question the responses are labeled “Yes,” “No,” or “Maybe.” The set of vectors “a,” and “b,” contain the response for each measurement. The third item in “a” is how the third person responded to the first question, and the third item in “b” is how the third person responded to the second question.

> a <- c("Sometimes","Sometimes","Never","Always","Always","Sometimes","Sometimes","Never")
> b <- c("Maybe","Maybe","Yes","Maybe","Maybe","No","Yes","No")
> results <- table(a,b)
> results
           b
a           Maybe No Yes
  Always        2  0   0
  Never         0  1   1
  Sometimes     2  1   1

The table command allows us to do a very quick calculation, and we can immediately see that two people who said “Maybe” to the first question also said “Sometimes” to the second question.

Just a, .toc-backrefs in the case with one-way tables it is possible to manually enter two way tables. The procedure is exactly the same as above except that we now have more than one row. We give a brief example below to demonstrate how to enter a two-way table that includes breakdown of a group of people by both their gender and whether or not they smoke. You enter all of the data as one long list but tell R to break it up into some number of columns:

> sexsmoke<-matrix(c(70,120,65,140),ncol=2,byrow=TRUE)
> rownames(sexsmoke)<-c("male","female")
> colnames(sexsmoke)<-c("smoke","nosmoke")
> sexsmoke <- as.table(sexsmoke)
> sexsmoke
       smoke nosmoke
male      70     120
female    65     140

The matrix command creates a two by two matrix. The byrow=TRUE option indicates that the numbers are filled in across the rows first, and the ncols=2 indicates that there are two columns.

3. Basic Operations and Numerical Descriptions

Contents

We look at some of the basic operations that you can perform on lists of numbers. It is assumed that you know how to enter data or read data files which is covered in the first chapter, and you know about the basic data types.

3.1. Basic Operations

Once you have a vector (or a list of numbers) in memory most basic operations are available. Most of the basic operations will act on a whole vector and can be used to quickly perform a large number of calculations with a single command. There is one thing to note, if you perform an operation on more than one vector it is often necessary that the vectors all contain the same number of entries.

Here we first define a vector which we will call “a” and will look at how to add and subtract constant numbers from all of the numbers in the vector. First, the vector will contain the numbers 1, 2, 3, and 4. We then see how to add 5 to each of the numbers, subtract 10 from each of the numbers, multiply each number by 4, and divide each number by 5.

> a <- c(1,2,3,4)
> a
[1] 1 2 3 4
> a + 5
[1] 6 7 8 9
> a - 10
[1] -9 -8 -7 -6
> a*4
[1]  4  8 12 16
> a/5
[1] 0.2 0.4 0.6 0.8

We can save the results in another vector called b:

> b <- a - 10
> b
[1] -9 -8 -7 -6

If you want to take the square root, find e raised to each number, the logarithm, etc., then the usual commands can be used:

> sqrt(a)
[1] 1.000000 1.414214 1.732051 2.000000
> exp(a)
[1]  2.718282  7.389056 20.085537 54.598150
> log(a)
[1] 0.0000000 0.6931472 1.0986123 1.3862944
> exp(log(a))
[1] 1 2 3 4

By combining operations and using parentheses you can make more complicated expressions:

> c <- (a + sqrt(a))/(exp(2)+1)
> c
[1] 0.2384058 0.4069842 0.5640743 0.7152175

Note that you can do the same operations with vector arguments. For example to add the elements in vector a to the elements in vector b use the following command:

> a + b
[1] -8 -6 -4 -2

The operation is performed on an element by element basis. Note this is true for almost all of the basic functions. So you can bring together all kinds of complicated expressions:

> a*b
[1]  -9 -16 -21 -24
> a/b
[1] -0.1111111 -0.2500000 -0.4285714 -0.6666667
> (a+3)/(sqrt(1-b)*2-1)
[1] 0.7512364 1.0000000 1.2884234 1.6311303

You need to be careful of one thing. When you do operations on vectors they are performed on an element by element basis. One ramification of this is that all of the vectors in an expression must be the same length. If the lengths of the vectors differ then you may get an error message, , .toc-backrefor worse, a warning message and unpredictable results:

> a <- c(1,2,3)
> b <- c(10,11,12,13)
> a+b
[1] 11 13 15 14
Warning message:
longer object length
        is not a multiple of shorter object length in: a + b

As you work in R and create new vectors it can be easy to lose track of what variables you have defined. To get a list of all of the variables that have been defined use the ls() command:

> ls()
[1] "a"            "b"            "bubba"        "c"            "last.warning"
[6] "tree"         "trees"

Finally, you should keep in mind that the basic operations almost always work on an element by element basis. There are rare exceptions to this general rule. For example, if you look at the minimum of two vectors using the min command you will get the minimum of all of the numbers. There is a special command, called pmin, that may be the command you want in some circumstances:

> a <- c(1,-2,3,-4)
> b <- c(-1,2,-3,4)
> min(a,b)
[1] -4
> pmin(a,b)
[1] -1 -2 -3 -4

3.2. Basic Numerical Descriptions

Given a vector of numbers there are some basic commands to make it easier to get some of the basic numerical descriptions of a set of numbers. Here we assume that you can read in the tree data that was discussed in a previous chapter. It is assumed that it is stored in a variable called tree:

> tree <- read.csv(file="trees91.csv",header=TRUE,sep=",");
> names(tree)
 [1] "C"      "N"      "CHBR"   "REP"    "LFBM"   "STBM"   "RTBM"   "LFNCC"
 [9] "STNCC"  "RTNCC"  "LFBCC"  "STBCC"  "RTBCC"  "LFCACC" "STCACC" "RTCACC"
[17] "LFKCC"  "STKCC"  "RTKCC"  "LFMGCC" "STMGCC" "RTMGCC" "LFPCC"  "STPCC"
[25] "RTPCC"  "LFSCC"  "STSCC"  "RTSCC"

Each column in the data frame can be accessed as a vector. For example the numbers associated with the leaf biomass (LFBM) can be found using tree$LFBM:

> tree$LFBM
 [1] 0.430 0.400 0.450 0.820 0.520 1.320 0.900 1.180 0.480 0.210 0.270 0.310
[13] 0.650 0.180 0.520 0.300 0.580 0.480 0.580 0.580 0.410 0.480 1.760 1.210
[25] 1.180 0.830 1.220 0.770 1.020 0.130 0.680 0.610 0.700 0.820 0.760 0.770
[37] 1.690 1.480 0.740 1.240 1.120 0.750 0.390 0.870 0.410 0.560 0.550 0.670
[49] 1.260 0.965 0.840 0.970 1.070 1.220

The following commands can be used to get the mean, median, quantiles, minimum, maximum, variance, and standard deviation of a set of numbers:

> mean(tree$LFBM)
[1] 0.7649074
> median(tree$LFBM)
[1] 0.72
> quantile(tree$LFBM)
    0%    25%    50%    75%   100%
0.1300 0.4800 0.7200 1.0075 1.7600
> min(tree$LFBM)
[1] 0.13
> max(tree$LFBM)
[1] 1.76
> var(tree$LFBM)
[1] 0.1429382
> sd(tree$LFBM)
[1] 0.3780717

Finally, the summary command will print out the min, max, mean, median, and quantiles:

> summary(tree$LFBM)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
 0.1300  0.4800  0.7200  0.7649  1.0080  1.7600

The summary command is especially nice because if you give it a data frame it will print out the summary for every vector in the data frame:

> summary(tree)
       C               N              CHBR         REP             LFBM
 Min.   :1.000   Min.   :1.000   A1     : 3   Min.   : 1.00   Min.   :0.1300
 1st Qu.:2.000   1st Qu.:1.000   A4     : 3   1st Qu.: 9.00   1st Qu.:0.4800
 Median :2.000   Median :2.000   A6     : 3   Median :14.00   Median :0.7200
 Mean   :2.519   Mean   :1.926   B2     : 3   Mean   :13.05   Mean   :0.7649
 3rd Qu.:3.000   3rd Qu.:3.000   B3     : 3   3rd Qu.:20.00   3rd Qu.:1.0075
 Max.   :4.000   Max.   :3.000   B4     : 3   Max.   :20.00   Max.   :1.7600
                                 (Other):36   NA's   :11.00
      STBM             RTBM            LFNCC           STNCC
 Min.   :0.0300   Min.   :0.1200   Min.   :0.880   Min.   :0.3700
 1st Qu.:0.1900   1st Qu.:0.2825   1st Qu.:1.312   1st Qu.:0.6400
 Median :0.2450   Median :0.4450   Median :1.550   Median :0.7850
 Mean   :0.2883   Mean   :0.4662   Mean   :1.560   Mean   :0.7872
 3rd Qu.:0.3800   3rd Qu.:0.5500   3rd Qu.:1.788   3rd Qu.:0.9350
 Max.   :0.7200   Max.   :1.5100   Max.   :2.760   Max.   :1.2900

     RTNCC            LFBCC           STBCC           RTBCC
 Min.   :0.4700   Min.   :25.00   Min.   :14.00   Min.   :15.00
 1st Qu.:0.6000   1st Qu.:34.00   1st Qu.:17.00   1st Qu.:19.00
 Median :0.7500   Median :37.00   Median :18.00   Median :20.00
 Mean   :0.7394   Mean   :36.96   Mean   :18.80   Mean   :21.43
 3rd Qu.:0.8100   3rd Qu.:41.00   3rd Qu.:20.00   3rd Qu.:23.00
 Max.   :1.5500   Max.   :48.00   Max.   :27.00   Max.   :41.00

     LFCACC           STCACC           RTCACC           LFKCC
 Min.   :0.2100   Min.   :0.1300   Min.   :0.1100   Min.   :0.6500
 1st Qu.:0.2600   1st Qu.:0.1600   1st Qu.:0.1600   1st Qu.:0.8100
 Median :0.2900   Median :0.1700   Median :0.1650   Median :0.9000
 Mean   :0.2869   Mean   :0.1774   Mean   :0.1654   Mean   :0.9053
 3rd Qu.:0.3100   3rd Qu.:0.1875   3rd Qu.:0.1700   3rd Qu.:0.9900
 Max.   :0.3600   Max.   :0.2400   Max.   :0.2400   Max.   :1.1800
                                                    NA's   :1.0000
     STKCC           RTKCC           LFMGCC           STMGCC
 Min.   :0.870   Min.   :0.330   Min.   :0.0700   Min.   :0.100
 1st Qu.:0.940   1st Qu.:0.400   1st Qu.:0.1000   1st Qu.:0.110
 Median :1.055   Median :0.475   Median :0.1200   Median :0.130
 Mean   :1.105   Mean   :0.473   Mean   :0.1109   Mean   :0.135
 3rd Qu.:1.210   3rd Qu.:0.520   3rd Qu.:0.1300   3rd Qu.:0.150
 Max.   :1.520   Max.   :0.640   Max.   :0.1400   Max.   :0.190

     RTMGCC            LFPCC            STPCC            RTPCC
 Min.   :0.04000   Min.   :0.1500   Min.   :0.1500   Min.   :0.1000
 1st Qu.:0.06000   1st Qu.:0.2000   1st Qu.:0.2200   1st Qu.:0.1300
 Median :0.07000   Median :0.2400   Median :0.2800   Median :0.1450
 Mean   :0.06648   Mean   :0.2381   Mean   :0.2707   Mean   :0.1465
 3rd Qu.:0.07000   3rd Qu.:0.2700   3rd Qu.:0.3175   3rd Qu.:0.1600
 Max.   :0.09000   Max.   :0.3100   Max.   :0.4100   Max.   :0.2100

     LFSCC            STSCC            RTSCC
 Min.   :0.0900   Min.   :0.1400   Min.   :0.0900
 1st Qu.:0.1325   1st Qu.:0.1600   1st Qu.:0.1200
 Median :0.1600   Median :0.1800   Median :0.1300
 Mean   :0.1661   Mean   :0.1817   Mean   :0.1298
 3rd Qu.:0.1875   3rd Qu.:0.2000   3rd Qu.:0.1475
 Max.   :0.2600   Max.   :0.2800   Max.   :0.1700

3.3. Operations on Vectors

Here we look at some commonly used commands that perform operations on lists. The commands include the sort, min, max, and sum commands. First, the sort command can sort the given vector in either ascending or descending order:

> a = c(2,4,6,3,1,5)
> b = sort(a)
> c = sort(a,decreasing = TRUE)
> a
[1] 2 4 6 3 1 5
> b
[1] 1 2 3 4 5 6
> c
[1] 6 5 4 3 2 1

The min and the max commands find the minimum and the maximum numbers in the vector:

> min(a)
[1] 1
> max(a)
[1] 6

Finally, the sum command adds up the numbers in the vector:

> sum(a)
[1] 21

4. Basic Probability Distributions

Contents

We look at some of the basic operations associated with probability distributions. There are a large number of probability distributions available, but we only look at a few. If you would like to know what distributions are available you can do a search using the command help.search(“distribution”).

Here we give details about the commands associated with the normal distribution and briefly mention the commands for other distributions. The functions for different distributions are very similar where the differences are noted below.

For this chapter it is assumed that you know how to enter data which is covered in the previous chapters.

To get a full list of the distributions available in R you can use the following command:

help(Distributions)

For every distribution there are four commands. The commands for each distribution are prepended with a letter to indicate the functionality:

“d” returns the height of the probability density function
“p” returns the cumulative density function
“q” returns the inverse cumulative density function (quantiles)
“r” returns randomly generated numbers

4.1. The Normal Distribution

There are four functions that can be used to generate the values associated with the normal distribution. You can get a full list of them and their options using the help command:

> help(Normal)

The first function we look at it is dnorm. Given a set of values it returns the height of the probability distribution at each point. If you only give the points it assumes you want to use a mean of zero and standard deviation of one. There are options to use different values for the mean and standard deviation, though:

> dnorm(0)
[1] 0.3989423
> dnorm(0)*sqrt(2*pi)
[1] 1
> dnorm(0,mean=4)
[1] 0.0001338302
> dnorm(0,mean=4,sd=10)
[1] 0.03682701
>v <- c(0,1,2)
> dnorm(v)
[1] 0.39894228 0.24197072 0.05399097
> x <- seq(-20,20,by=.1)
> y <- dnorm(x)
> plot(x,y)
> y <- dnorm(x,mean=2.5,sd=0.1)
> plot(x,y)

The second function we examine is pnorm. Given a number or a list it computes the probability that a normally distributed random number will be less than that number. This function also goes by the rather ominous title of the “Cumulative Distribution Function.” It accepts the same options as dnorm:

> pnorm(0)
[1] 0.5
> pnorm(1)
[1] 0.8413447
> pnorm(0,mean=2)
[1] 0.02275013
> pnorm(0,mean=2,sd=3)
[1] 0.2524925
> v <- c(0,1,2)
> pnorm(v)
[1] 0.5000000 0.8413447 0.9772499
> x <- seq(-20,20,by=.1)
> y <- pnorm(x)
> plot(x,y)
> y <- pnorm(x,mean=3,sd=4)
> plot(x,y)

If you wish to find the probability that a number is larger than the given number you can use the lower.tail option:

> pnorm(0,lower.tail=FALSE)
[1] 0.5
> pnorm(1,lower.tail=FALSE)
[1] 0.1586553
> pnorm(0,mean=2,lower.tail=FALSE)
[1] 0.9772499

The next function we look at is qnorm which is the inverse of pnorm. The idea behind qnorm is that you give it a probability, and it returns the number whose cumulative distribution matches the probability. For example, if you have a normally distributed random variable with mean zero and standard deviation one, then if you give the function a probability it returns the associated Z-score:

> qnorm(0.5)
[1] 0
> qnorm(0.5,mean=1)
[1] 1
> qnorm(0.5,mean=1,sd=2)
[1] 1
> qnorm(0.5,mean=2,sd=2)
[1] 2
> qnorm(0.5,mean=2,sd=4)
[1] 2
> qnorm(0.25,mean=2,sd=2)
[1] 0.6510205
> qnorm(0.333)
[1] -0.4316442
> qnorm(0.333,sd=3)
[1] -1.294933
> qnorm(0.75,mean=5,sd=2)
[1] 6.34898
> v = c(0.1,0.3,0.75)
> qnorm(v)
[1] -1.2815516 -0.5244005  0.6744898
> x <- seq(0,1,by=.05)
> y <- qnorm(x)
> plot(x,y)
> y <- qnorm(x,mean=3,sd=2)
> plot(x,y)
> y <- qnorm(x,mean=3,sd=0.1)
> plot(x,y)

The last function we examine is the rnorm function which can generate random numbers whose distribution is normal. The argument that you give it is the number of random numbers that you want, and it has optional arguments to specify the mean and standard deviation:

> rnorm(4)
[1]  1.2387271 -0.2323259 -1.2003081 -1.6718483
> rnorm(4,mean=3)
[1] 2.633080 3.617486 2.038861 2.601933
> rnorm(4,mean=3,sd=3)
[1] 4.580556 2.974903 4.756097 6.395894
> rnorm(4,mean=3,sd=3)
[1]  3.000852  3.714180 10.032021  3.295667
> y <- rnorm(200)
> hist(y)
> y <- rnorm(200,mean=-2)
> hist(y)
> y <- rnorm(200,mean=-2,sd=4)
> hist(y)
> qqnorm(y)
> qqline(y)

4.2. The t Distribution

There are four functions that can be used to generate the values associated with the t distribution. You can get a full list of them and their options using the help command:

> help(TDist)

These commands work just like the commands for the normal distribution. One difference is that the commands assume that the values are normalized to mean zero and standard deviation one, so you have to use a little algebra to use these functions in practice. The other difference is that you have to specify the number of degrees of freedom. The commands follow the same kind of naming convention, and the names of the commands are dt, pt, qt, and rt.

A few examples are given below to show how to use the different commands. First we have the distribution function, dt:

> x <- seq(-20,20,by=.5)
> y <- dt(x,df=10)
> plot(x,y)
> y <- dt(x,df=50)
> plot(x,y)

Next we have the cumulative probability distribution function:

> pt(-3,df=10)
[1] 0.006671828
> pt(3,df=10)
[1] 0.9933282
> 1-pt(3,df=10)
[1] 0.006671828
> pt(3,df=20)
[1] 0.996462
> x = c(-3,-4,-2,-1)
> pt((mean(x)-2)/sd(x),df=20)
[1] 0.001165548
> pt((mean(x)-2)<, .toc-backref/span>/sd(x),df=40)
[1] 0.000603064

Next we have the inverse cumulative probability distribution function:

> qt(0.05,df=10)
[1] -1.812461
> qt(0.95,df=10)
[1] 1.812461
> qt(0.05,df=20)
[1] -1.724718
> qt(0.95,df=20)
[1] 1.724718
> v <- c(0.005,.025,.05)
> qt(v,df=253)
[1] -2.595401 -1.969385 -1.650899
> qt(v,df=25)
[1] -2.787436 -2.059539 -1.708141

Finally random numbers can be generated according to the t distribution:

> rt(3,df=10)
[1] 0.9440930 2.1734365 0.6785262
> rt(3,df=20)
[1]  0.1043300 -1.4682198  0.0715013
> rt(3,df=20)
[1]  0.8023832 -0.4759780 -1.0546125

4.3. The Binomial Distribution

There are four functions that can be used to generate the values associated with the binomial distribution. You can get a full list of them and their options using the help command:

> help(Binomial)

These commands work just like the commands for the normal distribution. The binomial distribution requires two extra parameters, the number of trials and the probability of success for a single trial. The commands follow the same kind of naming convention, and the names of the commands are dbinom, pbinom, qbinom, and rbinom.

A few examples are given below to show how to use the different commands. First we have the distribution function, dbinom:

> x <- seq(0,50,by=1)
> y <- dbinom(x,50,0.2)
> plot(x,y)
> y <- dbinom(x,50,0.6)
> plot(x,y)
> x <- seq(0,100,by=1)
> y <- dbinom(x,100,0.6)
> plot(x,y)

Next we have the cumulative probability distribution function:

> pbinom(24,50,0.5)
[1] 0.4438624
> pbinom(25,50,0.5)
[1] 0.5561376
> pbinom(25,51,0.5)
[1] 0.5
> pbinom(26,51,0.5)
[1] 0.610116
> pbinom(25,50,0.5)
[1] 0.5561376
> pbinom(25,50,0.25)
[1] 0.999962
> pbinom(25,500,0.25)
[1] 4.955658e-33

Next we have the inverse cumulative probability distribution function:

> qbinom(0.5,51,1/2)
[1] 25
> qbinom(0.25,51,1/2)
[1] 23
> pbinom(23,51,1/2)
[1] 0.2879247
> pbinom(22,51,1/2)
[1] 0.200531

Finally random numbers can be generated according to the binomial distribution:

> rbinom(5,100,.2)
[1] 30 23 21 19 18
> rbinom(5,100,.7)
[1] 66 66 58 68 63

4.4. The Chi-Squared Distribution

There are four functions that can be used to generate the values associated with the Chi-Squared distribution. You can get a full list of them and their options using the help command:

> help(Chisquare)

These commands work just like the commands for the normal distribution. The first difference is that it is assumed that you have normalized the value so no mean can be specified. The other difference is that you have to specify the number of degrees of freedom. The commands follow the same kind of naming convention, and the names of the commands are dchisq, pchisq, qchisq, and rchisq.

A few examples are given below to show how to use the different commands. First we have the distribution function, dchisq:

> x <- seq(-20,20,by=.5)
> y <- dchisq(x,df=10)
> plot(x,y)
> y <- dchisq(x,df=12)
> plot(x,y)

Next we have the cumulative probability distribution function:

> pchisq(2,df=10)
[1] 0.003659847
> pchisq(3,df=10)
[1] 0.01857594
> 1-pchisq(3,df=10)
[1] 0.981424
> pchisq(3,df=20)
[1] 4.097501e-06
> x = c(2,4,5,6)
> pchisq(x,df=20)
[1] 1.114255e-07 4.649808e-05 2.773521e-04 1.102488e-03

Next we have the inverse cumulative probability distribution function:

> qchisq(0.05,df=10)
[1] 3.940299
> qchisq(0.95,df=10)
[1] 18.30704
> qchisq(0.05,df=20)
[1] 10.85081
> qchisq(0.95,df=20)
[1] 31.41043
> v <- c(0.005,.025,.05)
> qchisq(v,df=253)
[1] 198.8161 210.8355 217.1713
> qchisq(v,df=25)
[1] 10.51965 13.11972 14.61141

Finally random numbers can be generated according to the Chi-Squared distribution:

> rchisq(3,df=10)
[1] 16.80075 20.28412 12.39099
> rchisq(3,df=20)
[1] 17.838878  8.591936 17.486372
> rchisq(3,df=20)
[1] 11.19279 23.86907 24.81251

5. Basic Plots

Contents

This is a basic introduction to some of the basic plotting commands.

w1.dat and trees91.csv are read using “read.csv” into variables w1 and tree:

>  w1 <- read.csv(file="w1.dat",sep=",",head=TRUE)
> names(w1)
[1] "vals"
>  tree <- read.csv(file="trees91.csv",sep=",",head=TRUE)
> names(tree)
 [1] "C"      "N"      "CHBR"   "REP"    "LFBM"   "STBM"   "RTBM"   "LFNCC"
 [9] "STNCC"  "RTNCC"  "LFBCC"  "STBCC"  "RTBCC"  "LFCACC" "STCACC" "RTCACC"
[17] "LFKCC"  "STKCC"  "RTKCC"  "LFMGCC" "STMGCC" "RTMGCC" "LFPCC"  "STPCC"
[25] "RTPCC"  "LFSCC"  "STSCC"  "RTSCC"

5.1. Strip Charts

A strip chart is the most basic type of plot available. It plots the data in order along a line with each data point represented as a box.

To create a strip chart of this data use the stripchart command:

> help(stripchart)
> stripchart(w1$vals)
_images/stripchart.png

There is no title nor axes labels. It only shows how the data looks if you were to p, .toc-backrefut it all along one line and mark out a box at each point.
If you would prefer to see which points are repeated you can specify that repeated points be stacked:

> stripchart(w1$vals,method="stack")

A variation on this is to have the boxes moved up and down so that there is more separation between them:

> stripchart(w1$vals,method="jitter")

If you do not want the boxes plotting in the horizontal direction you can plot them in the vertical direction:

> stripchart(w1$vals,vertical=TRUE)
> stripchart(w1$vals,vertical=TRUE,method="jitter")

Since you should always annotate your plots there are many different ways to add titles and labels. One way is within the stripchart command itself:

> stripchart(w1$vals,method="stack",
             main='Leaf BioMass in High CO2 Environment',
             xlab='BioMass of Leaves')

If you have a plot already and want to add a title, you can use the title command:

> title('Leaf BioMass in High CO2 Environment',xlab='BioMass of Leaves')

Note that this simply adds the title and labels and will write over the top of any titles or labels you already have.

5.2. Histograms

A histogram plots the frequencies that data appears within certain ranges.

To plot a histogram of the data use the “hist” command:

> hist(w1$vals)
> hist(w1$vals,main="Distribution of w1",xlab="w1")
_images/histogramW1.png

R will automatically calculate the intervals to use. There are many options to determine how to break up the intervals.

You can specify the number of breaks to use using the breaks option. Here we look at the histogram for various numbers of breaks:

> hist(w1$vals,breaks=2)
> hist(w1$vals,breaks=4)
> hist(w1$vals,breaks=6)
> hist(w1$vals,breaks=8)
> hist(w1$vals,breaks=12)
>

You can also vary the size of the domain using the xlim option. This option takes a vector with two entries in it, the left value and the right value:

> hist(w1$vals,breaks=12,xlim=c(0,10))
> hist(w1$vals,breaks=12,xlim=c(-1,2))
> hist(w1$vals,breaks=12,xlim=c(0,2))
> hist(w1$vals,breaks=12,xlim=c(, .toc-backref1,1.3))
> hist(w1$vals,breaks=12,xlim=c(0.9,1.3))
>

The options for adding titles and labels are exactly the same as for strip charts. You should always annotate your plots and there are many different ways to add titles and labels. One way is within the hist command itself:

> hist(w1$vals,
       main='Leaf BioMass in High CO2 Environment',
       xlab='BioMass of Leaves')

If you have a plot already and want to change or add a title, you can use the title command:

> title('Leaf BioMass in High CO2 Environment',xlab='BioMass of Leaves')

Note that this simply adds the title and labels and will write over the top of any titles or labels you already have.

It is not uncommon to add other kinds of plots to a histogram. For example, one of the options to the stripchart command is to add it to a plot that has already been drawn. For example, you might want to have a histogram with the strip chart drawn across the top. The addition of the strip chart might give you a better idea of the density of the data:

> hist(w1$vals,main='Leaf BioMass in High CO2 Environment',xlab='BioMass of Leaves',ylim=c(0,16))
> stripchart(w1$vals,add=TRUE,at=15.5)

5.3. Boxplots

A boxplot provides a graphical view of the median, quartiles, maximum, and minimum of a data set. Here we provide examples using two different data sets. The first is the w1 data frame mentioned at the top of this page, and the one column of data is w1$vals. The second is the tree data frame from the trees91.csv data file which is also mentioned at the top of the page.

We first use the w1 data set and look at the boxplot of this data set:

> boxplot(w1$vals)
_images/simpleBoxPlot.png

Again, this is a very plain graph, and the title and labels can be specified in exactly the same way as in the stripchart and hist commands:

> boxplot(w1$vals,
          main='Leaf BioMass in High CO2 Environment',
          ylab='BioMass of Leaves')

Note that the default orientation is to plot the boxplot vertically. Because of this we used the ylab option to specify the axis label. There are a large number of options for this command. To see more of the options see the help page:

> help(boxplot)

As an example you can specify that the boxplot be plotted horizontally by specifying the horizontal option:

> boxplot(w1$vals,
          main='Leaf BioMass in High CO2 Environment',
          xlab='BioMass of Leaves',
          horizontal=TRUE)

The option to plot the box plot horizontally can be put to good use to display a box plot on the same image as a histogram. You need to specify the add option, specify where to put the box plot using the at option, and turn off the addition of axes using the axes option:

> hist(w1$vals,main='Leaf BioMass in High CO2 Environment',xlab='BioMass of Leaves',ylim=c(0,16))
> boxplot(w1$vals,horizontal=TRUE,at=15.5,add=TRUE,axes=FALSE)

If you are feeling really crazy you can take a histogram and add a box plot and a strip chart:

> hist(w1$vals,main='Leaf BioMass in High CO2 Environment',xlab='BioMass of Leaves',ylim=c(0,16))
> boxplot(w1$vals,horizontal=TRUE,at=16,add=TRUE,axes=FALSE)
> stripchart(w1$vals,add=TRUE,at=15)

Some people shell out good money to have this much fun.

For the second part on boxplots we will look at the second data frame, “tree,” which comes from the trees91.csv file. To reiterate the discussion at the top of this page and the discussion in the data types chapter, we need to specify which columns are factors:

> tree <- read.csv(file="trees91.csv",sep=",",head=TRUE)
> tree$C <- factor(tree$C)
> tree$N <- factor(tree$N)

We can look at the boxplot of just the data for the stem biomass:

> boxplot(tree$STBM,
          main='Stem BioMass in Different CO2 Environments',
          ylab='BioMass of Stems')

That plot does not tell the whole story. It is for all of the trees, but the trees were grown in different kinds of environments. The boxplot command can be used to plot a separate box plot for each level. In this ca, .toc-backrefse the data is held in “tree$STBM,” and the different levels are stored as factors in “tree$C.” The command to create different boxplots is the following:

boxplot(tree$STBM~tree$C)

Note that for the level called “2” there are four outliers which are plotted as little circles. There are many options to annotate your plot including different labels for each level. Please use the help(boxplot) command for more information.

5.4. Scatter Plots

A scatter plot provides a graphical view of the relationship between two sets of numbers. Here we provide examples using the tree data frame from the trees91.csv data file which is mentioned at the top of the page. In particular we look at the relationship between the stem biomass (“tree$STBM”) and the leaf biomass (“tree$LFBM”).

The command to plot each pair of points as an x-coordinate and a y-coorindate is “plot:”

> plot(tree$STBM,tree$LFBM)

It appears that there is a strong positive association between the biomass in the stems of a tree and the leaves of the tree. It appears to be a linear relationship. In fact, the corelation between these two sets of observations is quite high:

> cor(tree$STBM,tree$LFBM)
[1] 0.911595

Getting back to the plot, you should always annotate your graphs. The title and labels can be specified in exactly the same way as with the other plotting commands:

> plot(tree$STBM,tree$LFBM,
       main="Relationship Between Stem and Leaf Biomass",
       xlab="Stem Biomass",
       ylab="Leaf Biomass")

5.5. Normal QQ Plots

The final type of plot that we look at is the normal quantile plot. This plot is used to determine if your data is close to being normally distributed. You cannot be sure that the data is normally distributed, but you can rule out if it is not normally distributed. Here we provide examples using the w1 data frame mentioned at the top of this page, and the one column of data is w1$vals.

The command to generate a normal quantile plot is qqnorm. You can give it one argument, .toc-backref, the univariate data set of interest:

> qqnorm(w1$vals)

You can annotate the plot in exactly the same way as all of the other plotting commands given here:

> qqnorm(w1$vals,
         main="Normal Q-Q Plot of the Leaf Biomass",
         xlab="Theoretical Quantiles of the Leaf Biomass",
         ylab="Sample Quantiles of the Leaf Biomass")

After you creat the normal quantile plot you can also add the theoretical line that the data should fall on if they were normally distributed:

> qqline(w1$vals)

In this example you should see that the data is not quite normally distributed. There are a few outliers, and it does not match up at the tails of the distribution.

6. Intermediate Plotting

Contents

<, .toc-backreful class="simple">
  • Continuous Data
  • Discrete Data
  • Miscellaneous Options
  • We look at some more options for plotting, and we assume that you are familiar with the basic plotting commands (Basic Plots). A variety of different subjects ranging from plotting options to the formatting of plots is given.

    In many of the examples below we use some of R’s commands to generate random numbers according to various distributions. The section is divided into three sections. The focus of the first section is on graphing continuous data. The focus of the second section is on graphing discrete data. The third section offers some miscellaneous options that are useful in a variety of contexts.

    6.1. Continuous Data

    Contents

    In the examples below a data set is defined using R’s normally distributed random number generator.

    > x <- rnorm(10,sd=5,mean=20)
    > y <- 2.5*x - 1.0 + rnorm(10,sd=9,mean=0)
    > cor(x,y)
    [1] 0.7400576
    

    6.1.1. Multiple Data Sets on One Plot

    One common task is to plot multiple data sets on the same plot. In many situations the way to do this is to create the initial plot and then add additional information to the plot. For example, to plot bivariate data the plot command is used to initialize and create the plot. The points command can then be used to add additional data sets to the plot.

    First define a set of normally distributed random numbers and then plot them. (This same data set is used throughout the examples below.)

    > x <- rnorm(10,sd=5,mean=20)
    > y <- 2.5*x - 1.0 + rnorm(10,sd=9,mean=0)
    > cor(x,y)
    [1] 0.7400576
    > plot(x,y,xlab="Independent",ylab="Dependent",main="Random Stuff")
    > x1 <- runif(8,15,25)
    > y1 <- 2.5*x1 - 1.0 + runif(8,-6,6)
    > points(x1,y1,col=2)
    

    Note that in the previous example, the colour for the second set of data points is set using the col option. You can try different numbers to see what colours are available. For most installations there are at least eight options from 1 to 8. Also note that in the example above the points are plotted as circles. The symbol that is used can be changed using the pch option.

    > x2 <- runif(8,15,25)
    > y2 <- 2.5*x2 - 1.0 + runif(8,-6,6)
    > points(x2,y2,col=3,pch=2)
    

    Again, try different numbers to see the various options. Another helpful option is to add a legend. This can be done with the legend command. The options for the command, in order, are the x and y coordinates on the plot to place the legend followed by a list of labels to use. There are a large number of other options so use help(legend) to see more options. For example a list of colors can be given with the col option, and a list of symbols can be given with the pch option.

    > plot(x,y,xlab="Independent",ylab="Dependent",main="Random Stuff")
    > points(x1,y1,col=2,pch=3)
    > points(x2,y2,col=4,pch=5)
    > legend(14,70,c("Original","one","two"),col=c(1,2,4),pch=c(1,3,5))
    
    Three data sets displayed together.

    Figure 1.

    The three data sets displayed on the same graph.

    Another common task is to change the limits of the axes to change the size of the plotting area. This is achieved using the xlim and ylim options in the plot command. Both options take a vector of length two that have the minimum and maximum values.

    > plot(x,y,xlab="Independent",ylab="Dependent",main="Random Stuff",xlim=c(0,30),ylim=c(0,100))
    > points(x1,y1,col=2,pch=3)
    > points(x2,y2,col=4,pch=5)
    > legend(14,70,c("Original","one","two"),col=c(1,2,4),pch=c(1,3,5))
    

    6.1.2. Error Bars

    Another common task is to add error bars to a set of data points. This can be accomplished using the arrows command. The arrows command takes two pairs of coordinates, that is two pairs of x and y values. The command then draws a line between each pair and adds an “arrow head” with a given length and angle.

    > plot(x,y,xlab="Independent",ylab="Dependent",main="Random Stuff")
    > xHigh <- x
    > yHigh <- y + abs(rnorm(10,sd=3.5))
    > xLow <- x
    > yLow <- y - abs(rnorm(10,sd=3.1))
    > arrows(xHigh,yHigh,xLow,yLow,col=2,angle=90,length=0.1,code=3)
    
    Figure with error bars added

    Figure 2.

    A data set with error bars added.

    Note that the option code is used to specify where the bars are drawn. Its value can be 1, 2, or 3. If code is 1 the bars are drawn at pairs given in the first argument. If code is 2 the bars are drawn at the pairs given in the second argument. If code is 3 the bars are drawn at both.

    6.1.3. Adding Noise (jitter)

    In the previous example a little bit of “noise” was added to the pairs to produce an artificial offset. This is a common thing to do for making plots. A simpler way to accomplish this is to use the jitter command.

    > numberWhite <- rhyper(400,4,5,3)
    > numberChipped <- rhyper(400,2,7,3)
    > par(mfrow=c(1,2))
    > plot(numberWhite,numberChipped,xlab="Number White Marbles Drawn",
           ylab="Number Chipped Marbles Drawn",main="Pulling Marbles")
    > plot(jitter(numberWhite),jitter(numberChipped),xlab="Number White Marbles Drawn",
           ylab="Number Chipped Marbles Drawn",main="Pulling Marbles With Jitter")
    
    Points with noise added using the jitter command.

    Figure 3.

    Points with noise added using the jitter command.

    6.1.4. Multiple Graphs on One Image

    Note that a new command was used in the previous example. The par command can be used to set different parameters. In the example above the mfrow was set. The plots are arranged in an array where the default number of rows and columns is one. The mfrow parameter is a vector with two entries. The first entry is the number of rows of images. The second entry is the number of columns. In the example above the plots were arranged in one row with two plots across.

    > par(mfrow=c(2,3))
    > boxplot(numberWhite,main="first plot")
    > boxplot(numberChipped,main="second plot")
    > plot(jitter(numberWhite),jitter(numberChipped),xlab="Number White Marbles Drawn",
           ylab="Number Chipped Marbles Drawn",main="Pulling Marbles With Jitter")
    > hist(numberWhite,main="fourth plot")
    > hist(numberChipped,main="fifth plot")
    > mosaicplot(table(numberWhite,numberChipped),main="sixth plot")
    
    An array of plots using the par command.

    Figure 4.

    An array of plots using the par command.

    6.1.5. Density Plots

    There are times when you do not want to plot specific points but wish to plot a density. This can be done using the smoothScatter command.

    > numberWhite <- rhyper(30,4,5,3)
    > numberChipped <- rhyper(30,2,7,3)
    > smoothScatter(numberWhite,numberChipped,
                 xlab="White Marbles",ylab="Chipped Marbles",main="Drawing Marbles")
    
    Using smoothScatter to plot densities.

    Figure 5.

    The SmoothScatter can be used to plot densities.

    Note that the previous example may benefit by superimposing a grid to help delimit the points of interest. This can be done using the grid command.

    > numberWhite <- rhyper(30,4,5,3)
    > numberChipped <- rhyper(30,2,7,3)
    > smoothScatter(numberWhite,numberChipped,
                 xlab="White Marbles",ylab="Chipped Marbles",main="Drawing Marbles")
    > grid(4,3)
    

    6.1.6. Pairwise Relationships

    There are times that you want to explore a large number of relationships. A number of relationships can be plotted at one time using the pairs command. The idea is that you give it a matrix or a data frame, and the command will create a scatter plot of all combinations of the data.

    > uData <- rnorm(20)
    > vData <- rnorm(20,mean=5)
    > wData <- uData + 2*vData + rnorm(20,sd=0.5)
    > xData <- -2*uData+rnorm(20,sd=0.1)
    > yData <- 3*vData+rnorm(20,sd=2.5)
    > d <- data.frame(u=uData,v=vData,w=wData,x=xData,y=yData)
    > pairs(d)
    
    An array of plots using the pairs command.

    Figure 5.

    Using pairs to produce all permutations of a set of relationships on one graph.

    6.1.7. Shaded Regions

    A shaded region can be plotted using the polygon command. The polygon command takes a pair of vectors, x and y, and shades the region enclosed by the coordinate pairs. In the example below a blue square is drawn. The vertices are defined starting from the lower left. Five pairs of points are given because the starting point and the ending point is the same.

    > x = c(-1,1,1,-1,-1)
    > y = c(-1,-1,1,1,-1)
    > plot(x,y)
    > polygon(x,y,col='blue')
    >
    

    A more complicated example is given below. In this example the rejection region for a right sided hypothesis test is plotted, and it is shaded in red. A set of custom axes is constructed, and symbols are plotted using the expression command.

    > stdDev <- 0.75;
    > x <- seq(-5,5,by=0.01)
    > y <- dnorm(x,sd=stdDev)
    > right <- qnorm(0.95,sd=stdDev)
    > plot(x,y,type="l",xaxt="n",ylab="p",
           xlab=expression(paste('Assumed Distribution of ',bar(x))),
           axes=FALSE,ylim=c(0,max(y)*1.05),xlim=c(min(x),max(x)),
           frame.plot=FALSE)
    > axis(1,at=c(-5,right,0,5),
           pos = c(0,0),
           labels=c(expression(' '),expression(bar(x)[cr]),expression(mu[0]),expression(' ')))
    > axis(2)
    > xReject <- seq(right,5,by=0.01)
    > yReject <- dnorm(xReject,sd=stdDev)
    > polygon(c(xReject,xReject[length(xReject)],xRe, .toc-backrefject[1]),
              c(yReject,0, 0), col='red')
    
    Example of a shaded region

    Figure 6.

    Using polygon to produce a shaded region.

    The axes are drawn separately. This is done by first suppressing the plotting of the axes in the plot command, and the horizontal axis is drawn separately. Also note that the expression command is used to plot a Greek character and also produce subscripts.

    6.1.8. Plotting a Surface

    Finally, a brief example of how to plot a surface is given. The persp command will plot a surface with a specified perspective. In the example, a grid is defined by multiplying a row and column vector to give the x, .toc-backref and then the y values for a grid. Once that is done a sine function is specified on the grid, and the persp command is used to plot it.

    > x <- seq(0,2*pi,by=pi/100)
    > y <- x
    > xg <- (x*0+1) %*% t(y)
    > yg <- (x) %*% t(y*0+1)
    > f <- sin(xg+yg)
    > persp(x,y,f,theta=-10,phi=40)
    >
    

    The %*% notation is used to perform matrix multiplication.

    6.2. Discrete Data

    Contents

    In the examples below a data set is defined using R’s hypergeometric random number generator.

    > numberWhite <- rhyper(30,4,5,3)
    > numberChipped <- rhyper(30,2,7,3)
    

    6.2.1. Barplot

    The plot command will try to produce the appropriate plots based on the data type. The data that is defined above, though, is numeric data. You need to convert the data to factors to make sure that the plot command treats it in an appropriate way. The as.factor command is used to cast the data as factors and ensures that R treats it as discrete data.

    > numberWhite <- rhyper(30,4,5,3)
    > numberWhite <- as.factor(numberWhite)
    > plot(numberWhite)
    >
    

    In this case R will produce a barplot. The barplot command can also be used to create a barplot. The barplot command requires a vector of heights, though, and you cannot simply give it the raw data. The frequencies for the barplot command can be easily calculated using the table command.

    > numberWhite <- rhyper(30,4,5,3)
    > totals <- table(numberWhite)
    > totals
    numberWhite
    0  1  2  3
    4 13 11  2
    > barplot(totals,main="Number Draws",ylab="Frequency",xlab="Draws")
    >
    

    In the previous example the barplot command is used to set the title for the plot and the labels for the axes. The labels on the ticks for the horizontal axis are automatically generated using the labels on the table. You can change the labels by setting the row names of the table.

    > totals <- table(numberWhite)
    > rownames(totals) <- c("none","one","two","three")
    > totals
    numberWhite
    none   one   two three
    4      13    11     2
    > barplot(totals,main="Number Draws",ylab="Frequency",xlab="Draws")
    >
    

    The order of the frequencies is the same as the order in the table. If you change the order in the table it will change the way it appears in the barplot. For example, if you wish to arrange the frequencies in descending order you can use the sort command with the decreasing option set to TRUE.

    > barplot(sort(totals,decreasing=TRUE),main="Number Draws",ylab="Frequency",xlab="Draws")
    

    The indexing f, .toc-backrefeatures of R can be used to change the order of the frequencies manually.

    > totals
    numberWhite
     none   one   two three
       4    13    11     2
    > sort(totals,decreasing=TRUE)
    numberWhite
     one   two  none three
      13    11     4     2
    > totals[c(3,1,4,2)]
    numberWhite
     two  none three   one
      11     4     2    13
    > barplot(totals[c(3,1,4,2)])
    >
    

    The barplot command returns the horizontal locations of the bars. Using the locations and putting together the previous ideas a Pareto Chart can be constructed.

    > xLoc = barplot(sort(totals,decreasing=TRUE),main="Number Draws",
               ylab="Frequency",xlab="Draws",ylim=c(0,sum(totals)+2))
    > points(xLoc,cumsum(sort(totals,decreasing=TRUE)),type='p',col=2)
    > points(xLoc,cumsum(sort(totals,decreasing=TRUE)),type='l')
    >
    

    6.2.2. Mosaic Plot

    Mosaic plots are used to display proportions for tables that are divided into two or more conditional distributions. Here we focus on two way tables to keep things simpler. It is assumed that you are familiar with using tables in R (see the section on two way tables for more information: Two Way Tables).

    Here we will use a made up data set primarily to make it easier to figure out what R is doing. The fictitious data set is defined below. The idea is that sixteen children of age eight are interviewed. They are asked two questions. The first question is, “do you believe in Santa Claus.” If they say that they do then the term “belief” is recorded, otherwise the term “no belief” is recorded. The second question is whether or not they have an older brother, older sister, or no older sibling. (We are keeping it simple here!) The answers that are recorded are “older brother,” “older sister,” or “no older sibling.”

    > santa <- data.frame(belief=c('no belief','no belief','no belief','no belief',
                                   'belief','belief','belief','belief',
                                   'belief','belief','no belief','no belief',
                                   'belief','belief','no belief','no belief'),
                          sibling=c('older brother','older brother','older brother','older sister',
                                    'no older sibling','no older sibling','no older sibling','older sister',
                                    'older brother','older sister','older brother','older sister',
                                    'no older sibling','older sister','older brother','no older sibling')
                          )
     > santa
           belief          sibling
     1  no belief    older brother
     2  no belief    older brother
     3  no belief    older brother
     4  no belief     older sister
     5     belief no older sibling
     6     belief no older sibling
     7     belief no older sibling
     8     belief     older sister
     9     belief    older brother
     10    belief     older sister
     11 no belief    older brother
     12 no belief     older sister
     13    belief no older sibling
     14    belief     older sister
     15 no belief    older brother
     16 no belief no older sibling
     >  summary(santa)
       belief              sibling
     belief   :8   no older sibling:5
     no belief:8   older brother   :6
                   older sister    :5
    

    The data is given as strings, so R will automatically treat them as categorical data, and the data types are factors. If you plot the individual data sets, the plot command will default to producing barplots.

    > plot(santa$belief)
    > plot(santa$sibling)
    >
    

    If you provide both data sets it will automatically produce a mosaic plot which demonstrates the relative frequencies in terms of the resulting areas.

    > plot(santa$sibling,santa$belief)
    > plot(santa$belief,santa$sibling)
    

    The mosaicplot command can be called directly

    > totals = table(santa$belief,santa$sibling)
    > totals
    
                no older sibling older brother older sister
      belief                   4             1            3
      no belief                1             5            2
    > mosaicplot(totals,main="Older Brothers are Jerks",
                  xlab="Belief in Santa Claus",ylab="Older Sibling")
    

    The colours of the plot can be specified by setting the col argument. The argument is a vector of colours used for the rows. See Fgure :ref`figure7_intermediatePlotting` for an example.

    > mosaicplot(totals,main="Older Brothers are Jerks",
                  xlab="Belief in Santa Claus",ylab="Older Sibling",
                  col=c(2,3,4))
    
    Example of a mosaic plot

    Figure 7.

    Example of a mosaic plot with colours.

    The labels and the order that they appear in the plot can be changed in exactly the same way as given in the examples for barplot above.

    > rownames(totals)
    [1] "belief"    "no belief"
    > colnames(totals)
    [1] "no older sibling" "older brother"    "older sister"
    > rownames(totals) <- c("Believes","Does not Believe")
    > colnames(totals) <- c("No Older","Older Brother","Older Sister")
    > totals
    
                       No Older Older Brother Older Sister
      Believes                4             1            3
      Does not Believe        1             5            2
    > mosaicplot(totals,main="Older Brothers are Jerks",
                  xlab="Belief in Santa Claus",ylab="Older Sibling")
    

    When changing the order keep in mind that the table is a two dimension, .toc-backrefal array. The indices must include both rows and columns, and the transpose command (t) can be used to switch how it is plotted with respect to the vertical and horizontal axes.

    > totals
    
                       No Older Older Brother Older Sister
      Believes                4             1            3
      Does not Believe        1             5            2
    > totals[c(2,1),c(2,3,1)]
    
                       Older Brother Older Sister No Older
      Does not Believe             5            2        1
      Believes                     1            3        4
    > mosaicplot(totals[c(2,1),c(2,3,1)],main="Older Brothers are Jerks",
           xlab="Belief in Santa Claus",ylab="Older Sibling",col=c(2,3,4))
    > mosaicplot(t(totals),main="Older Brothers are Jerks",
           ylab="Belief in Santa Claus",xlab="Older Sibling",col=c(2,3))
    

    6.3. Miscellaneous Options

    Contents

    The previous examples only provide a slight hint at what is possible. Here we give some examples that provide a demonstration of the way the different commands can be combined and the options that allow them to be used together.

    6.3.1. Multiple Representations On One Plot

    First, an example of a histogram with an approximation of the density function is given. In addition to the density function a horizontal boxplot is added to the plot with a rug representation of the data on the horizontal axis. The horizontal bounds on the histogram will be specified. The boxplot must be added to the histogram, and it will be raised above the histogram.

    > x = rexp(20,rate=4)
    > hist(x,ylim=c(0,18),main="This Are An Histogram",xlab="X")
    > boxplot(x,at=16,horizontal=TRUE,add=TRUE)
    > rug(x,side=1)
    > d = density(x)
    > points(d,type='l',col=3)
    >
    

    6.3.2. Multiple Windows

    The dev commands allow you to create and manipulate multiple graphics windows. You can create new windows using the dev.new() command, and you can choose which one to make active using the dev.set() command. The dev.list(), dev.next(), and dev.prev() command can be used to list the graphical devices that are available.

    In the following example three devices are created. They are listed, and different plots are created on the different devices.

    > dev.new()
    > dev.new()
    > dev.new()
    > dev.list()
    X11cairo X11cairo X11cairo
           2        3        4
    > dev.set(3)
    X11cairo
           3
    > x = rnorm(20)
    > hist(x)
    > dev.set(2)
    X11cairo
           2
    > boxplot(x)
    > dev.set(4)
    X11cairo
           4
    > qqnorm(x)
    > qqline(x)
    > dev.next()
    X11cairo
    , .toc-backref       2
    > dev.set(dev.next())
    X11cairo
           2
    > plot(density(x))
    >
    

    6.3.4. Annotation and Formatting

    Basic annotation can be performed in the regular plotting commmands. For example, there are options to specify labels on axes as well as titles. More options are available using the axis command.

    Most of the primary plotting commands have an option to turn off the generation of the axes using the axes=FALSE option. The axes can be then added using the axis command which allows for a greater number of options.

    In the example below a bivariate set of random numbers are generated and plotted as a scatter plot. The axes are added, but the horizontal axis is located in the center of the data rather than at the bottom of the figure. Note that the horizontal and vertical axes are added separately, and are specified using the first argument to the command. (Use help(axis) for a full list of options.)

    > x <- rnorm(10,mean=0,sd=4)
    > y <- 3*x-1+rnorm(10,mean=0,sd=2)
       summary(x)
        Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
     -6.1550 -1.9280  1.2000 -0.1425  2.4780  3.1630
    > summary(y)
        Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
    -17.9800  -9.0060   0.7057  -1.2060   8.2600  10.9200
    > plot(x,y,axes=FALSE,col=2)
    > axis(1,pos=c(0,0),at=seq(-7,5,by=1))
    > axis(2,pos=c(0,0),at=seq(-18,11,by=2))
    >
    

    In the previous example the at option is used to specify the tick marks.

    When using the plot command the default behavior is to draw an axis as well as draw a box around the plotting area. The drawing of the box can be suppressed using the bty option. The value can be “o,” “l,” “7,” “c,” “u”, “],” or “n.” (The lines drawn roughly look like the letter given except for “n” which draws no lines.) The box can be drawn later using the box command as well.

    > x <- rnorm(10,mean=0,sd=4)
    > y <- 3*x-1+rnorm(10,mean=0,sd=2)
    > plot(x,y,bty="7")
    > plot(x,y,bty="n")
    > box(lty=3)
    >
    

    The par command can be used to set the default values for various parameters. A couple are given below. In the example below the default background is set to grey, no box will be drawn around the window, and the margins for the axes will be twice the normal size.

    > par(bty="l")
    > par(bg="gray")
    > par(mex=2)
    > x <- rnorm(10,mean=0,sd=4)
    > y <- 3*x-1+rnorm(10,mean=0,sd=2)
    > plot(x,y)
    >
    

    Another common task is to place a text string on the plot. The text command takes a coordinate and a label, and it places the label at the given coordinate. The text command has options for setting the offset, size, font, and other options. In the example below the label “numbers!” is placed on the plot. Use help(text) to see more options.

    > x <- rnorm(10,mean=0,sd=<, .toc-backrefspan class="m">4)
    > y <- 3*x-1+rnorm(10,mean=0,sd=2)
    > plot(x,y)
    > text(-1,-2,"numbers!")
    >
    

    The default text command will cut off any characters outside of the plot area. This behavior can be overridden using the xpd option.

    > x <- rnorm(10,mean=0,sd=4)
    > y <- 3*x-1+rnorm(10,mean=0,sd=2)
    > plot(x,y)
    > text(-7,-2,"o, .toc-backrefutside the area",xpd=TRUE)
    >
    

    7. Indexing Into Vectors

    Contents

    Given a vector of data one common task is to isolate particular entries or censor items that meet some criteria. Here we show how to use R’s indexing notation to pick out specific items within a vector.

    7.1. Indexing With Logicals

    We first give an example of how to select specific items in a vector. The first step is to define a vector of data, and the second step is to define a vector made up of logical values. When the vector of logical values is used for the index into the vector of data values only the items corresponding to the variables that evaluate to TRUE are returned:

    > a <- c(1,2,3,4,5)
    > b <- c(TRUE,FALSE,FALSE,TRUE,FALSE)
    > a[b]
    [1] 1 4
    > max(a[b])
    [1] 4
    > sum(a[b])
    [1] 5
    

    7.2. Not Available or Missing Values

    O, .toc-backrefne common problem is data entries that are marked NA or not available. There is a predefined variable called NA that can be used to indicate missing information. The problem with this is that some functions throw an error if one of the entries in the data is NA. Some functions allow you to ignore the missing values through special options:

    > a <- c(1,2,3,4,NA)
    > a
    [1]  1  2  3  4 NA
    > sum(a)
    [1] NA
    > sum(a,na.rm=TRUE)
    [1] 10
    

    There are other times, though, when this option is not available, or you simply want to censor them. The is.na function can be used to determine which items are not available. The logical “not” operator in R is the ! symbol. When used with the indexing notation the items within a vector that are NA can be easily removed:

    > a <- c(1,2,3,4,NA)
    > is.na(a)
    [1] FALSE FALSE FALSE FALSE  TRUE
    > !is.na(a)
    [1]  TRUE  TRUE  TRUE  TRUE FALSE
    > a[!is.na(a)]
    [1] 1 2 3 4
    > b <- a[!is.na(a)]
    > b
    [1] 1 2 3 4
    

    7.3. Indices With Logical Expression

    Any logical expression can be used as an index which opens a wide range of possibilities. For example, you can remove or focus on entries that match specific criteria. For example, you might want to remove all entries that are above a certain value:

    > a = c(6,2,5,3,8,2)
    > a
    [1] 6 2 5 3 8 2
    > b = a[a<6]
    > b
    [1] 2 5 3 2
    

    For another example, suppose you want to join together the values that match two different factors in another vector:

    > d = data.frame(one=as.factor(c('a','a','b','b','c','c')),
                    two=c(1,2,3,4,5,6))
    > d
      one two
    1   a   1
    2   a   2
    3   b   3
    4   b   4
    5   c   5
    6   c   6
    > both = d$two[(d$one=='a') | (d$one=='b')]
    > both
    [1] 1 2 3 4
    

    Note that a single ‘|’ was used in the previous example. There is a difference between ‘||’ and ‘|’. A single bar will perform a vector operation, term by term, while a double bar will evaluate to a single TRUE or FALSE result:

    > (c(TRUE,TRUE))|(c(FALSE,TRUE))
    [1] TRUE TRUE
    > (c(TRUE,TRUE))||(c(FALSE,TRUE))
    [1] TRUE
    > (c(TRUE,TRUE))&(c(FALSE,TRUE))
    [1] FALSE  TRUE
    > (c(TRUE,TRUE))&&(c(FALSE,TRUE))
    [1] FALSE
    

    8. Linear Least Squares Regression

    Here we look at the most basic linear least squares regression. The main purpose is to provide an example of the basic commands. It is assumed that you know how to enter data or read data files which is covered in the first chapter, and it is assumed that you are familiar with the different data types.

    We will examine the interest rate for four year car loans, and the data that we use comes from the U.S. Federal Reserve’s mean rates . We are looking at and plotting means. This, of course, is a very bad thing because it removes a lot of the variance and is misleading. The only reason that we are working with the data in this way is to provide an example of linear regression that does not use too many data points. Do not try this without a professional near you, and if a professional is not near you do not tell anybody you did this. They will laugh at you. People are mean, especially professionals.

    The first thing to do is to specify the data. Here there are only five pairs of numbers so we can enter them in manually. Each of the five pairs consists of a year and the mean interest rate:

    > year <- c(2000 ,   2001  ,  2002  ,  2003 ,   2004)
    > rate <- c(9.34 ,   8.50  ,  7.62  ,  6.93  ,  6.60)
    

    The next thing we do is take a look at the data. We first plot the data using a scatter plot and notice that it looks linear. To confirm our suspicions we then find the correlation between the year and the mean interest rates:

    > plot(year,rate,
         main="Commercial Banks Interest Rate for 4 Year Car Loan",
         sub="http://www.federalreserve.gov/releases/g19/20050805/")
    > cor(year,rate)
    [1] -0.9880813
    
    exercise:
    year <- c(2000 ,   2001  ,  2002  ,  2003 ,   2004)
    rate <- c(9.34 ,   8.50  ,  7.62  ,  6.93  ,  6.60)
    
    plot(year,rate,
         main="Commercial Banks Interest Rate for 4 Year Car Loan",
         sub="http://www.federalreserve.gov/releases/g19/20050805/")
    cor(year,rate)
    
    fit <- lm(rate ~ year)
    fit
    
    attributes(fit)
    names(fit)
    length(fit)
    
    fit$coefficients[[2]]*2015+fit$coefficients[[1]]
    

    At this point we should be excited because associations that strong never happen in the real world unless you cook the books or work with averaged data. The next question is what straight line comes “closest” to the data? In this case we will use least squares regression as one way to determine the line.

    Before we can find the least square regression line we have to make some decisions. First we have to decide which is the explanatory and which is the response variable. Here, we arbitrarily pick the explanatory variable to be the year, and the response variable is the interest rate. This was chosen because it seems like the interest rate might change in time rather than time changing as the interest rate changes. (We could be wrong, finance is very confusing.)

    The command to perform the least square regression is the lm command. The command has many options, but we will keep it simple and not explore them here. If you are interested use the help(lm) command to learn more. Instead the only option we examine is the one necessary argument which specifies the relationship.

    Since we specified that the interest rate is the response variable and the year is the explanatory variable this means that the regression line can be written in slope-intercept form:

    \[rate = (slope) year + (intercept)\]

    The way that this relationship is defined in the lm command is that you write the vector containing the response variable, a tilde (“~”), and a vector containing the explanatory variable:

    > fit <- lm(rate ~ year)
    > fit
    Call:
    lm(formula = rate ~ year)
    
    Coefficients:
    (Intercept)         year
       1419.208       -0.705
    

    When you make the call to lm it returns a variable with a lot of information in it. If you are just learning about least squares regression you are probably only interested in two things at this point, the slope and the y-intercept. If you just type the name of the variable returned by lm it will print out this minimal information to the screen. (See above.)

    If you would like to know what else is stored in the variable you can use the attributes command:

    > attributes(fit)
    $names
     [1] "coefficients"  "residuals"     "effects"       "rank"
     [5] "fitted.values" "assign"        "qr"            "df.residual"
     [9] "xlevels"       "call"          "terms"         "model"
    
    $class
    [1] "lm"
    

    One of the things you should notice is the coefficients variable within fit. You can print out the y-intercept and slope by accessing this part of the variable:

    > fit$coefficients[1]
    (Intercept)
       1419.208
    > fit$coefficients[[1]]
    [1] 1419.208
    > fit$coefficients[2]
      year
    -0.705
    > fit$coefficients[[2]]
    [1] -0.705
    

    Note that if you just want to get the number you should use two square braces. So if you want to get an estimate of the interest rate in the year 2015 you can use the formula for a line:

    > fit$coefficients[[2]]*2015+fit$coefficients[[1]]
    [1] -1.367
    

    So if you just wait long enough, the banks will pay you to take a car!

    A better use for this formula would be to calculate the residuals and plot them:

    > res <- rate - (fit$coefficients[[2]]*year+fit$coefficients[[1]])
    > res
    [1]  0.132 -0.003 -0.178 -0.163  0.212
    > plot(year,res)
    

    That is a bit messy, but fortunately there are easier ways to get the residuals. Two other ways are shown below:

    > residuals(fit)
         1      2      3      4      5
     0.132 -0.003 -0.178 -0.163  0.212
    > fit$residuals
         1      2      3      4      5
     0.132 -0.003 -0.178 -0.163  0.212
    > plot(year,fit$residuals)
    >
    

    If you want to plot the regression line on the same plot as your scatter plot you can use the abline function along with your variable fit:

    > plot(year,rate,
         main="Commercial Banks Interest Rate for 4 Year Car Loan",
         sub="http://www.federalreserve.gov/releases/g19/20050805/&quo, .toc-backreft;)
    > abline(fit)
    

    Finally, as a teaser for the kinds of analyses you might see later, you can get the results of an F-test by asking R for a summary of the fit variable:

    > summary(fit)
    
    Call:
    lm(formula = rate ~ year)
    
    Residuals:
         1      2      3      4      5
     0.132 -0.003 -0.178 -0.163  0.212
    
    Coefficients:
                  Estimate Std. Error t value Pr(>|t|)
    (Intercept) 1419.20800  126.94957   11.18  0.00153 **
    year          -0.70500    0.06341  -11.12  0.00156 **
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    Residual standard error: 0.2005 on 3 degrees of freedom
    Multiple R-Squared: 0.9763,     Adjusted R-squared: 0.9684
    F-statistic: 123.6 on 1 and 3 DF,  p-value: 0.001559
    

    9. Calculating Confidence Intervals

    Contents

    Here we look at some examples of calculating confidence intervals. The examples are for both normal and t distributions. We assume that you can enter data and know the commands associated with basic probability. Note that an easier way to calculate confidence intervals using the t.test command is discussed in section The Easy Way.

    9.1. Calculating a Confidence Interval From a Normal Distribution

    Here we will look at a fictitious example. We will make some assumptions for what we might find in an experiment and find the resulting confidence interval using a normal distribution. Here we assume that the sample mean is 5, the standard deviation is 2, and the sample size is 20. In the example below we will use a 95% confidence level and wish to find the confidence interval. The commands to find the confidence interval in R are the following:

    > a <- 5
    > s <- 2
    > n <- 20
    > error <- qnorm(0.975)*s/sqrt(n)
    > left <- a-error
    > right <- a+error
    > left
    [1] 4.123477
    > right
    [1] 5.876523
    

    Our level of certainty about the true mean is 95% in predicting that the true mean is within the interval between 4.12 and 5.88 assuming that the original random variable is normally distributed, and the samples are independent.

    9.2. Calculating a Confidence Interval From a t Distribution

    Calculating the confidence interval when using a t-test is similar to using a normal distribution. The only difference is that we use the command associated with the t-distribution rather than the normal distribution. Here we repeat the procedures above, but we will assume that we are working with a sample standard deviation rather than an exact standard deviation.

    Again we assume that the sample mean is 5, the sample standard deviation is 2, and the sample size is 20. We use a 95% confidence level and wish to find the confidence interval. The commands to find the confidence interval in R are the following:

    > a <- 5
    > s <- 2
    > n <- 20
    > error <- qt(0.975,df=n-1)*s/sqrt(n)
    > left <- a-error
    > right <- a+error
    > left
    [1] 4.063971
    > right
    [1] 5.936029
    

    The true mean has a probability of 95% of being in the interval between 4.06 and 5.94 assuming that the original random variable is normally distributed, and the samples are independent.

    We now look at an example where we have a univariate data set and want to find the 95% confidence interval for the mean. In this example we use one of the data sets given in the data input chapter. We use the w1.dat data set:

    > w1 <- read.csv(file="w1.dat",sep=",",head=TRUE)
    , .toc-backref> summary(w1)
         vals
    Min.   :0.130
    1st Qu.:0.480
    Median :0.720
    Mean   :0.765
    3rd Qu.:1.008
    Max.   :1.760
    > length(w1$vals)
    [1] 54
    > mean(w1$vals)
    [1] 0.765
    > sd(w1$vals)
    [1] 0.3781222
    

    We can now calculate an error for the mean:

    > error <- qt(0.975,df=length(w1$vals)-1)*sd(w1$vals)/sqrt(length(w1$vals))
    > error
    [1] 0.1032075
    

    The confidence interval is found by adding and subtracting the error from the mean:

    > left <- mean(w1$vals)-error
    > right <- mean(w1$vals)+error
    > left
    [1] 0.6617925
    > right
    [1] 0.8682075
    

    There is a 95% probability that the true mean is between 0.66 and 0.87 assuming that the original random variable is normally distributed, and the samples are independent.

    9.3. Calculating Many Confidence Intervals From a t Distribution

    Suppose that you want to find the confidence intervals for many tests. This is a common task and most software packages will allow you to do this.

    We have three different sets of results:

    Comparison 1    
      Mean Std. Dev. Number (pop.)
    Group I 10 3 300
    Group II 10.5 2.5 230

    Comparison 2    
      Mean Std. Dev. Number (pop.)
    Group I 12 4 210
    Group II 13 5.3 340

    Comparison 3    
      Mean Std. Dev. Number (pop.)
    Group I 30 4.5 420
    Group II 28.5 3 400

    For each of these comparisons we want to calculate the associated confidence interval for the difference of the means. For each comparison there are two groups. We will refer to group one as the group whose results are in the first row of each comparison above. We will refer to group two as the group whose results are in the second row of each comparison above. Before we can do that we must first compute a standard error and a t-score. We will find general formulae which is necessary in order to do all three calculations at once.

    We assume that the means for the first group are defined in a variable called m1. The means for the second group are defined in a variable called m2. The standard deviations for the first group are in a variable called sd1. The standard deviations for the second group are in a variable called sd2. The number of samples for the first group are in a variable called num1. Finally, the number of samples for the second group are in a variable called num2.

    With these definitions the standard error is the square root of (sd1^2)/num1+(sd2^2)/num2. The R commands to do this can be found below:

    > m1 <- c(10,12,30)
    > m2 <- c(10.5,13,28.5)
    > sd1 <- c(3,4,4.5)
    > sd2 <- c(2.5,5.3,3)
    > num1 <- c(300,210,420)
    > num2 <- c(230,340,400)
    > se <- sqrt(sd1*sd1/num1+sd2*sd2/num2)
    > error <- qt(0.975,df=pmin(num1,num2)-1)*se
    

    To see the values just type in the variable name on a line alone:

    > m1
    [1] 10 12 30
    > m2
    [1] 10.5 13.0 28.5
    > sd1
    [1] 3.0 4.0 4.5
    > sd2
    [1] 2.5 5.3 3.0
    > num1
    [1] 300 210 420
    > num2
    [1] 230 340 400
    > se
    [1] 0.2391107 0.3985074 0.2659216
    > error
    [1] 0.4711382 0.7856092 0.5227825
    

    Now we need to define the confidence interval around the assumed differences. Just as in the case of finding the p values in previous chapter we have to use the pmin command to get the number of degrees of freedom. In this case the null hypotheses are for a difference of zero, and we use a 95% confidence interval:

    > left <- (m1-m2)-error
    > right <- (m1-m2)+error
    > left
    [1] -0.9711382 -1.7856092  0.9772175
    > right
    [1] -0.02886177 -0.21439076  2.02278249
    

    This gives the confidence intervals for each of the three tests. For example, in the first experiment the 95% confidence interval is between -0.97 and -0.03 assuming that the random variables are normally distributed, and the samples are independent.

    10. Calculating p Values

    Contents

    Here we look at some examples of calculating p values. The examples are for both normal and t distributions. We assume that you can enter data and know the commands associated with basic probability. We first show how to do the calculations the hard way and show how to do the calculations. The last method makes use of the t.test command and demonstrates an easier way to calculate a p value.

    10.1. Calculating a Single p Value From a Normal Distribution

    We look at the steps necessary to calculate the p value for a particular test. In the interest of simplicity we only look at a two sided test, and we focus on one example. Here we want to show that the mean is not close to a fixed value, a.

    \[\begin{split}H_o: \mu_x & = & a,\end{split}\]\[\begin{split}H_a: \mu_x & \neq & a,\end{split}\]

    The p value is calculated for a particular sample mean. Here we assume that we obtained a sample mean, x and want to find its p value. It is the probability that we would obtain a given sample mean that is greater than the absolute value of its Z-score or less than the negative of the absolute value of its Z-score.

    For the special case of a normal distribution we also need the standard deviation. We will assume that we are given the standard deviation and call it s. The calculation for the p value can be done in several of ways. We will look at two ways here. The first way is to convert the sample means to their associated Z-score. The other way is to simply specify the standard deviation and let the computer do the conversion. At first glance it may seem like a no brainer, and we should just use the second method. Unfortunately, when using the t-distribution we need to convert to the t-score, so it is a good idea to know both ways.

    We first look at how to calculate the p value using the Z-score. The Z-score is found by assuming that the null hypothesis is true, subtracting the assumed mean, and dividing by the theoretical standard deviation. Once t, .toc-backrefhe Z-score is found the probability that the value could be less the Z-score is found using the pnorm command.

    This is not enough to get the p value. If the Z-score that is found is positive then we need to take one minus the associated probability. Also, for a two sided test we need to multiply the result by two. Here we avoid these issues and insure that the Z-score is negative by taking the negative of the absolute value.

    We now look at a specific example. In the example below we will use a value of a of 5, a standard deviation of 2, and a sample size of 20. We then find the p value for a sample mean of 7:

    > a <- 5
    > s <- 2
    > n <- 20
    > xbar <- 7
    > z <- (xbar-a)/(s/sqrt(n))
    > z
    [1] 4.472136
    > 2*pnorm(-abs(z))
    [1] 7.744216e-06
    

    We now look at the same problem only specifying the mean and standard deviation within the pnorm command. Note that for this case we cannot so easily force the use of the left tail. Since the sample mean is more than the assumed mean we have to take two times one minus the probability:

    > a <- 5
    > s <- 2
    > n <- 20
    > xbar <- 7
    > 2*(1-pnorm(xbar,mean=a,sd=s/sqrt(20)))
    [1] 7.744216e-06
    

    10.2. Calculating a Single p Value From a t Distribution

    Finding the p value using a t distribution is very similar to using the Z-score as demonstrated above. The only difference is that you have to specify the number of degrees of freedom. Here we look at the same example as above but use the t distribution instead:

    > a <- 5
    > s <- 2
    > n <- 20
    > xbar <- 7
    > t <- (xbar-a)/(s/sqrt(n))
    > t
    [1] 4.472136
    > 2*pt(-abs(t),df=n-1)
    [1] 0.0002611934
    

    We now look at an example where we have a univariate data set and want to find the p value. In this example we use one of the data sets given in the data input chapter. We use the w1.dat data set:

    > w1 <- read.csv(file="w1.dat",sep=",",head=TRUE)
    > summary(w1)
          vals
     Min.   :0.130
     1st Qu.:0.480
     Median :0.720
     Mean   :0.765
     3rd Qu.:1.008
     Max.   :1.760
    > length(w1$vals)
    [1] 54
    

    Here we use a two sided hypothesis test,

    \[\begin{split}H_o: \mu_x & = & 0.7,\end{split}\]\[\begin{split}H_a: \mu_x & \neq & 0.7.\end{split}\]

    So we calculate the sample mean and sample standard deviation in order to calculate the p value:

    > t <- (mean(w1$vals)-0.7)/(sd(w1$vals)/sqrt(length(w1$vals)))
    > t
    [1] 1.263217
    > 2*pt(-abs(t),df=length(w1$vals)-1)
    [1] 0.21204
    

    10.3. Calculating Many p Values From a t Distribution

    Suppose that you want to find the p values for many tests. This is a common task and most software packages will allow you to do this. Here we see how it can be done in R.

    Here we assume that we want to do a one-sided hypothesis test for a number of comparisons. In particular we will look at three hypothesis tests. All are of the following form:

    \[\begin{split}H_o: \mu_1 - \mu_2 & = & 0,\end{split}\]\[\begin{split}H_a: \mu_1 - \mu_2 & \neq & 0.\end{split}\]

    We have three different sets of comparisons to make:

    Comparison 1    
      Mean Std. Dev. Number (pop.)
    Group I 10 3 300
    Group II 10.5 2.5 230

    Comparison 2    
      Mean Std. Dev. Number (pop.)
    Group I 12 4 210
    Group II 13 5.3 340

    Comparison 3    
      Mean Std. Dev. Number (pop.)
    Group I 30 4.5 420
    Group II 28.5 3 400

    For each of these comparisons we want to calculate a p value. For each comparison there are two groups. We will refer to group one as the group whose results are in the first row of each comparison above. We will refer to group two as the group whose results are in the second row of each comparison above. Before we can do that we must first compute a standard error and a t-score. We will find general formulae which is necessary in order to do all three calculations at once.

    We assume that the means for the first group are defined in a variable called m1. The means for the second group are defined in a variable called m2. The standard deviations for the first group are in a variable called sd1. The standard deviations for the second group are in a variable called sd2. The number of samples for the first group are in a variable called num1. Finally, the number of samples for the second group are in a variable called num2.

    With these definitions the standard error is the square root of (sd1^2)/num1+(sd2^2)/num2. The associated t-score is m1 minus m2 all divided by the standard error. The R comands to do this can be found below:

    > m1 <- c(10,12,30)
    > m2 <- c(10.5,13,28.5)
    > sd1 <- c(3,4,4.5)
    > sd2 <- c(2.5,5.3,3)
    > num1 <- c(300,210,420)
    > num2 <- c(230,340,400)
    > se <- sqrt(sd1*sd1/num1+sd2*sd2/num2)
    > t <- (m1-m2)/se
    

    To see the values just type in the vari, .toc-backrefable name on a line alone:

    > m1
    [1] 10 12 30
    > m2
    [1] 10.5 13.0 28.5
    > sd1
    [1] 3.0 4.0 4.5
    > sd2
    [1] 2.5 5.3 3.0
    > num1
    [1] 300 210 420
    > num2
    [1] 230 340 400
    > se
    [1] 0.2391107 0.3985074 0.2659216
    > t
    [1] -2.091082 -2.509364  5.640761
    

    To use the pt command we need to specify the number of degrees of freedom. This can be done using the pmin command. Note that there is also a command called min, but it does not work the same way. You need to use pmin to get the correct results. The numbers of degrees of freedom are pmin(num1,num2)-1. So the p values can be found using the following R command:

    > pt(t,df=pmin(num1,num2)-1)
    [1] 0.01881168 0.00642689 0.99999998
    

    If you enter all of these commands into R you should have noticed that the last p value is not correct. The pt command gives the probability that a score is less that the specified t. The t-score for the last entry is positive, and we want the probability that a t-score is bigger. One way around this is to make sure that all of the t-scores are negative. You can do this by taking the negative of the absolute value of the t-scores:

    > pt(-abs(t),df=pmin(num1,num2)-1)
    [1] 1.881168e-02 6.426890e-03 1.605968e-08
    

    The results from the command above should give you the p values for a one-sided test. It is left as an exercise how to find the p values for a two-sided test.

    10.4. The Easy Way

    The methods above demonstrate how to calculate the p values directly making use of the standard formulae. There is another, more direct way to do this using the t.test command. The t.test command takes a data set for an argument, and the default operation is to perform a two sided hypothesis test.

    > x = c(9.0,9.5,9.6,10.2,11.6)
    > t.test(x)
    
                                    One Sample t-test
    
    data:  x
    t = 22.2937, df = 4, p-value = 2.397e-05
    alternative hypothesis: true mean is not equal to 0
    95 percent confidence interval:
            8.737095 11.222905
    sample estimates:
    mean of x
                           9.98
    
    > help(t.test)
    >
    

    That was an obvious result. If you want to test against a different assumed mean then you can use the mu argument:

    > x = c(9.0,9.5,9.6,10.2,11.6)
    > t.test(x,mu=10)
    
                                    One Sample t-test
    
    data:  x
    t = -0.0447, df = 4, p-value = 0.9665
    alternative hypothesis: true mean is not equal to 10
    95 percent confidence interval:
            8.737095 11.222905
    sample estimates:
    mean of x
                           9.98
    

    If you are interested in a one sided test then you can specify which test to employ using the alternative option:

    > x = c(9.0,9.5,9.6,10.2,11.6)
    > t.test(x,mu=10,alternative="less")
    
                                    One Sample t-test
    
    data:  x
    t = -0.0447, df = 4, p-value = 0.4833
    alternative hypothesis: true mean is less than 10
    95 percent confidence interval:
                           -Inf 10.93434
    sample estimates:
    mean of x
                           9.98
    

    The t.test, .toc-backref() command also accepts a second data set to compare two sets of samples. The default is to treat them as independent sets, but there is an option to treat them as dependent data sets. (Enter help(t.test) for more information.) To test two different samples, the first two arguments should be the data sets to compare:

    > x = c(9.0,9.5,9.6,10.2,11.6)
    > y=c(9.9,8.7,9.8,10.5,8.9,8.3,9.8,9.0)
    > t.test(x,y)
            Welch Two Sample t-test
    
    data:  x and y
    t = 1.1891, df = 6.78, p-value = 0.2744
    alternative hypothesis true difference in means is not equal to 0
    95 percent confidence interval:
           -0.6185513  1.8535513
    sample estimates:
    mean of x mean of y
                   9.9800    9.3625
    

    11. Calculating The Power Of A Test

    Contents

    Here we look at some examples of calculating the power of a test. The examples are for both normal and t distributions. We assume that you can enter data and know the commands associated with basic probability. All of the examples here are for a two sided test, and you can adjust them accordingly for a one sided test.

    11.1. Calculating The Power Using a Normal Distribution

    Here we calculate the power of a test for a normal distribution for a specific example. Suppose that our hypothesis test is the following:

    \[\begin{split}H_o: \mu_x & = & a,\end{split}\]\[\begin{split}H_a: \mu_x & \neq & a,\end{split}\]

    The power of a test is the probability that we can the reject null hypothesis at a given mean that is away from the one specified in the null hypothesis. We calculate this probability by first calculating the probability that we accept the null hypothesis when we should not. This is the probability to make a type II error. The power is the probability that we do not make a type II error so we then take one minus the result to get the power.

    We can fail to reject the null hypothesis if the sample happens to be within the confidence interval we find when we assume that the null hypothesis is true. To get the confidence interval we find the margin of error and then add and subtract it to the proposed mean, a, to get the confidence interval. We then turn around and assume instead that the true mean is at a different, explicitly specified level, and then find the probability a sample could be found within the original confidence interval.

    In the example below the hypothesis test is for

    \[\begin{split}H_o: \mu_x & = & 5,\end{split}\]\[\begin{split}H_a: \mu_x & \neq & 5,\end{split}\]

    We will assume that the standard deviation is 2, and the sample size is 20. In the example below we will use a 95% confidence level and wish to find the power to detect a true mean that differs from 5 by an amount of 1.5. (A, .toc-backrefll of these numbers are made up solely for this example.) The commands to find the confidence interval in R are the following:

    > a <- 5
    > s <- 2
    > n <- 20
    > error <- qnorm(0.975)*s/sqrt(n)
    > left <- a-error
    > right <- a+error
    > left
    [1] 4.123477
    > right
    [1] 5.876523
    

    Next we find the Z-scores for the left and right values assuming that the true mean is 5+1.5=6.5:

    > assumed <- a + 1.5
    > Zleft <- (left-assumed)/(s/sqrt(n))
    > Zright <-(right-assumed)/(s/sqrt(n))
    > p <- pnorm(Zright)-pnorm(Zleft)
    > p
    [1] 0.08163792
    

    The probability that we make a type II error if the true mean is 6.5 is approximately 8.1%. So the power of the test is 1-p:

    > 1-p
    [1] 0.918362
    

    In this example, the power of the test is approximately 91.8%. If the true mean differs from 5 by 1.5 then the probability that we will reject the null hypothesis is approximately 91.8%.

    11.2. Calculating The Power Using a t Distribution

    Calculating the power when using a t-test is similar to using a normal distribution. One difference is that we use the command associated with the t-distribution rather than the normal distribution. Here we repeat the test above, but we will assume that we are working with a sample standard deviation rather than an exact standard deviation. We will explore three different ways to calculate the power of a test. The first method makes use of the scheme many books recommend if you do not have the non-central distribution available. The second does make use of the non-central distribution, and the third makes use of a single command that will do a lot of the work for us.

    In the example the hypothesis test is the same as above,

    \[\begin{split}H_o: \mu_x & = & 5,\end{split}\]\[\begin{split}H_a: \mu_x & \neq & 5,\end{split}\]

    Again we assume that the sample standard deviation is 2, and the sample size is 20. We use a 95% confidence level and wish to find the power to detect a true mean that differs from 5 by an amount of 1.5. The commands to find the confidence interval in R are the following:

    > a <- 5
    > s <- 2
    > n <- 20
    > error <- qt(0.975,df=n-1)*s/sqrt(n)
    > left <- a-error
    > right <- a+error
    > left
    [1] 4.063971
    > right
    [1] 5.936029
    

    The number of observations is large enough that the results are quite close to those in the example using the normal distribution. Next we find the t-scores for the left and right values assuming that the true mean is 5+1.5=6.5:

    > assumed <- a + 1.5
    > tleft <- (left-assumed)/(s/sqrt(n))
    > tright <- (right-assumed)/(s/sqrt(n))
    > p <- pt(tright,df=n-1)-pt(tleft,df=n-1)
    > p
    [1] 0.1112583
    

    The probability that we make a type II error if the true mean is 6.5 is approximately 11.1%. So the power of the test is 1-p:

    > 1-p
    [1] 0.8887417
    

    In this example, the power of the test is approximately 88.9%. If the true mean differs from 5 by 1.5 then the probability that we will reject the null hypothesis is approximately 88.9%. Note that the power calculated for a normal distribution is slightly higher than for this one calculated with the t-distribution.

    Another way to approximate the power is to make use of the non-centrality parameter. The idea is that you give it the critical t scores and the amount that the mean would be shifted if the alternate mean were the true mean. This is the method that most books recommend.

    > ncp <- 1.5/, .toc-backref(s/sqrt(n))
    > t <- qt(0.975,df=n-1)
    > pt(t,df=n-1,ncp=ncp)-pt(-t,df=n-1,ncp=ncp)
    [1] 0.1111522
    > 1-(pt(t,df=n-1,ncp=ncp)-pt(-t,df=n-1,ncp=ncp))
    [1] 0.8888478
    

    Again, we see that the probability of making a type II error is approximately 11.1%, and the power is approximately 88.9%. Note that this is slightly different than the previous calculation but is still close.

    Finally, there is one more command that we explore. This command allows us to do the same power calculation as above but with a single command.

    > power.t.test(n=n,delta=1.5,sd=s,sig.level=0.05,
            type="one.sample",alternative="two.sided",strict = TRUE)
    
         One-sample t test power calculation
    
                  n = 20
              delta = 1.5
                 sd = 2
          sig.level = 0.05
              power = 0.8888478
        alternative = two.sided
    

    This is a powerful command that can do much more than just calculate the power of a test. For example it can also be used to calculate the number of observations necessary to achieve a given power. For more information check out the help page, help(power.t.test).

    11.3. Calculating Many Powers From a t Distribution

    Suppose that you want to find the powers for many tests. This is a common task and most software packages will allow you to do this. Here we see how it can be done in R. We use the exact same cases as in the previous chapter.

    Here we assume that we want to do a two-sided hypothesis test for a number of comparisons and want to find the power of the tests to detect a 1 point difference in the means. In particular we will look at three hypothesis tests. All are of the following form:

    \[\begin{split}H_o: \mu_1 - \mu2 & = & 0,\end{split}\]\[\begin{split}H_a: \mu_1 - \mu_2 & \neq & 0,\end{split}\]

    We have three different sets of comparisons to make:

    Comparison 1    
      Mean Std. Dev. Number (pop.)
    Group I 10 3 300
    Group II 10.5 2.5 230

    Comparison 2    
      Mean Std. Dev. Number (pop.)
    Group I 12 4 210
    Group II 13 5.3 340

    Comparison 3    
      Mean Std. Dev. Number (pop.)
    Group I 30 4.5 420
    Group II 28.5 3 400

    For each of these comparisons we want to calculate the power of the test. For each comparison there are two groups. We will refer to group one as the group whose results are in the first row of each comparison above. We will refer to group two as the group whose results are in the second row of each comparison above. Before we can do that we must first compute a standard error and a t-score. We will find general formulae which is necessary in order to do all three calculations at once.

    We assume that the means for the first group are defined in a variable called m1. The means for the second group are defined in a variable called m2. The standard deviations for the first group are in a variable called sd1. The standard deviations for the second group are in a variable called sd2. The number of samples for the first group are in a variable called num1. Finally, the number of samples for the second group are in a variable called num2.

    With these definitions the standard error is the square root of (sd1^2)/num1+(sd2^2)/num2. The R commands to do this can be found below:

    > m1 <- c(10,12,30)
    > m2 <- c(10.5,13,28.5)
    > sd1 <- c(3,4,4.5)
    > sd2 <- c(2.5,5.3,3)
    > num1 <- c(300,210,420)
    > num2 <- c(230,340,400)
    > se <- sqrt(sd1*sd1/num1+sd2*sd2/num2)
    

    To see the values just type in the variable name on a line alone:

    > m1
    [1] 10 12 30
    > m2
    [1] 10.5 13.0 28.5
    > sd1
    [1] 3.0 4.0 4.5
    > sd2
    [1] 2.5 5.3 3.0
    > num1
    [1] 300 210 420
    > num2
    [1] 230 340 400
    > se
    [1] 0.2391107 0.3985074 0.2659216
    

    Now we need to define the confidence interval around the assumed differences. Just as in the case of finding the p values in previous chapter we have to use the pmin command to get the number of degrees of freedom. In this case the null hypotheses are for a difference of zero, and we use a 95% confidence interval:

    > left <- qt(0.025,df=pmin(num1,num2)-1)*se
    > right <- -left
    > left
    [1] -0.4711382 -0.7856092 -0.5227825
    > right
    [1] 0.4711382 0.7856092 0.5227825
    

    We can now calculate the power of the one sided test. Assuming a true mean of 1 we can calculate the t-scores associated with both the left and right variables:

    > tl <- (left-1)/se
    > tr <- (right-1)/se
    > tl
    [1] -6.152541 -4.480743 -5.726434
    > tr
    [1] -2.2117865 -0.5379844 -1.7945799
    > probII <- pt(tr,df=pmin(num1,num2)-1) -
    pt(tl,df=pmin(num1,num2)-1)
    > probII
    [1] 0.01398479 0.29557399 0.03673874
    > power <- 1-probII
    > power
    [1] 0.9860152 0.7044260 0.9632613
    

    The results from the command above should give you the p-values for a two-sided test. It is left as an exercise how to find the p-values for a one-sided test.

    Just as was found above there is more than one way to calculate the power. We also include the method using the non-central parameter which is recommen, .toc-backrefded over the previous method:

    > t <- qt(0.975,df=pmin(num1,num2)-1)
    > t
    [1] 1.970377 1.971379 1.965927
    > ncp <- (1)/se
    > pt(t,df=pmin(num1,num2)-1,ncp=ncp)-pt(-t,df=pmin(num1,num2)-1,ncp=ncp)
    [1] 0.01374112 0.29533455 0.03660842
    > 1-(pt(t,df=pmin(num1,num2)-1,ncp=ncp)-pt(-t,df=pmin(num1,num2)-1,ncp=ncp))
    [1] 0.9862589 0.7046655 0.9633916
    

    12. Two Way Tables

    Contents

    Here we look at some examples of how t, .toc-backrefo work with two way tables. We assume that you can enter data and understand the different data types.

    12.1. Creating a Table from Data

    We first look at how to create a table from raw data. Here we use a fictitious data set, smoker.csv. This data set was created only to be used as an example, and the numbers were created to match an example from a text book, p. 629 of the 4th edition of Moore and McCabe’s Introduction to the Practice of Statistics. You should look at the data set in a spreadsheet to see how it is entered. The information is ordered in a way to make it easier to figure out what information is in the data.

    The idea is that 356 people have been polled on their smoking status (Smoke) and their socioeconomic status (SES). For each person it was determined whether or not they are current smokers, former smokers, or have never smoked. Also, for each person their socioeconomic status was determined (low, middle, or high). The data file contains only two columns, and when read R interprets them both as factors:

    > smokerData <- read.csv(file='smoker.csv',sep=',',header=T)
    > summary(smokerData)
         Smoke         SES
     current:116   High  :211
     former :141   Low   : 93
     never  : 99   Middle: 52
    

    You can create a two way table of occurrences using the table command and the two columns in the data frame:

    > smoke <- table(smokerData$Smoke,smokerData$SES)
    > smoke
    
              High Low Middle
    , .toc-backref  current   51  43     22
      former    92  28     21
      never     68  22      9
    

    In this example, there are 51 people who are current smokers and are in the high SES. Note that it is assumed that the two lists given in the table command are both factors. (More information on this is available in the chapter on data types.)

    12.2. Creating a Table Directly

    Sometimes you are given data in the form of a table and would like to create a table. Here we examine how to create the table directly. Unfortunately, this is not as direct a method as might be desired. Here we create an array of numbers, specify the row and column names, and then convert it to a table.

    In the example below we will create a table identical to the one given above. In that example we have 3 columns, and the numbers are specified by going across each row from top to bottom. We need to specify the data and the number of rows:

    > smoke <- matrix(c(51,43,22,92,28,21,68,22,9),ncol=3,byrow=TRUE)
    > colnames(smoke) <- c("High","Low","Middle")
    > rownames(smoke) <- c("current","former","never")
    > smoke <- as.table(smoke)
    > smoke
            High Low Middle
    current   51  43     22
    former    92  28     21
    never     68  22      9
    

    12.3. Tools For Working With Tables

    Here we look at some of the commands available to help look at the information in a table in different ways. We assume that the data using one of the methods above, and the table is called “smoke.” First, there are a couple of ways to get graphical views of the data:

    > barplot(smoke,legend=T,beside=T,main='Smoking Status by SES')
    > plot(smoke,main="Smoking Status By Socioeconomic Status")
    

    There are a number of ways to get the marginal distributions using the margin.table command. If you just give the command the table it calculates the total number of observations. You can also calculate the marginal distributions across the rows or columns based on the one optional argument:

    > margin.table(smoke)
    [1] 356
    > margin.table(smoke,1)
    
    current  former   never
        116     141      99
    > margin.table(smoke,2)
    
      High    Low Middle
       211     93     52
    

    Combining these commands you can get the proportions:

    > smoke/margin.table(smoke)
    
                    High        Low     Middle
      current 0.14325843 0.12078652 0.06179775
      former  0.25842697 0.07865169 0.05898876
      never   0.19101124 0.06179775 0.02528090
    > margin.table(smoke,1)/margin.table(smoke)
    
      current    former     never
    0.3258427 0.3960674 0.2780899
    > margin.table(smoke,2)/margin.table(smoke)
    
         High       Low    Middle
    0.5926966 0.2612360 0.1460674
    

    That is a little obtuse, so fortunately, there is a better way to get the proportions using the prop.table command. You can specify the proportions with respect to the different marginal distributions using the optional argument:

    > prop.table(smoke)
    
                    High        Low     Middle
      current 0.14325843 0.12078652 0.06179775
      former  0.25842697 0.07865169 0.05898876
      never   0.19101124 0.06179775 0.02528090
    > prop.table(smoke,1)
    
                   High       Low    Middle
      current 0.4396552 0.3706897 0.1896552
      former  0.6524823 0.1985816 0.1489362
      never   0.6868687 0.2222222 0.0909091
    > prop.table(smoke,2)
    
                   High       Low    Middle
      current 0.2417062 0.4623656 0.4230769
      former  0.4360190 0.3010753 0.4038462
      never   0.3222749 0.2365591 0.1730769
    

    If you want to do a chi-squared test to determine if the proportions are different, there is an easy way to do this. If we want to test at the 95% confidence level we need only look at a summary of the table:

    > summary(smoke)
    Number of cases in table: 356
    Number of factors: 2
    , .toc-backrefTest for independence of all factors:
            Chisq = 18.51, df = 4, p-value = 0.0009808
    

    Since the p-value is less that 5% we can reject the null hypothesis at the 95% confidence level and can say that the proportions vary.

    Of course, there is a hard way to do this. This is not for the faint of heart and involves some linear algebra which we will not describe. If you wish to calculate the table of expected values then you need to multiply the vectors of the margins and divide by the total number of observations:

    > expected <- as.array(margin.table(smoke,1)) %*% t(as.array(margin.table(smoke,2))) / margin.table(smoke)
    > expected
    
                  High      Low   Middle
      current 68.75281 30.30337 16.94382
      former  83.57022 36.83427 20.59551
      never   58.67697 25.86236 14.46067
    

    (The “t” function takes the transpose of the array.)

    The result in this array and can be directly compared to the existing table. We need the square of the difference between the two tables divided by the expected values. The sum of all these values is the Chi-squared statistic:

    > chi <- sum((expected - as.array(smoke))^2/expected)
    > chi
    [1] 18.50974
    

    We can then get the p-value for this statistic:

    > 1-pchisq(chi,df=4)
    [1] 0.0009808236
    

    12.4. Graphical Views of Tables

    The plot command will automatically produce a mosaic plot if its primary argument is a table. Alternatively, you can call the mosaicplot command directly.

    > smokerData <- read.csv(file='smoker.csv',sep=',',header=T)
    > smoke <- table(smokerData$Smoke,smokerData$SES)
    > mosaicplot(smoke)
    > help(mosaicplot)
    >
    

    The mosaicplot command takes many of the same arguments for annotating a plot:

    > mosaicplot(smoke,main="Smokers",xlab="Status",ylab="Economic Class")
    >
    

    If you wish to switch which side (horizontal versus vertical) to determine the primary proportion then you can use the sort option. This can , .toc-backrefbe used to switch whether the width or height is used for the first proportional length:

    > mosaicplot(smoke,main="Smokers",xlab="Status",ylab="Economic Class")
    > mosaicplot(smoke,sort=c(2,1))
    >
    

    Finally if you wish to switch which side is used for the vertical and horzintal axis you can use the dir option:

    > mosaicplot(smoke,main="Smokers",xlab="Status",ylab="Economic Class")
    > mosaicplot(smoke,dir=c("v","h"))
    >
    

    13. Data Management

    Contents

    Here we look at some common tasks that come up when dealing with data. These tasks range from assembling different data sets into more convenient forms and ways to apply functions to different parts of the data sets. The topics in this section demonstrate some of the power of R, but it may not be clear at first. The functions are commonly used in a wide variety of circumstances for a number of different reasons. These tools have saved me a great deal of time and effort in circumstances that I would not have predicted in advance.

    The important thing to note, though, is that this section is called “Data Management.” It is not called “Data Manipulation.” Politicians “manipulate” data, we “manage” them.

    13.1. Appending Data

    When you have more than one set of data you may want to bring them together. You can bring different data sets together by appending as rows (rbind) or by appending as columns (cbind). The first example shows how this done with two data frames. The arguments to the functions can take any number of objects. We only use two here to keep the demonstration simpler, but additional data frames can be appended in the same call. It is important to note that when you bring things together as rows the names of the objects within the data frame must be the same.

    > a <- data.frame(one=c( 0, 1, 2),two=c("a","a","b"))
    > b <- data.frame(one=c(10,11,12),two=c("c","c","d"))
    > a
      one two
    1   0   a
    2   1   a
    3   2   b
    > b
      one two
    1  10   c
    2  11   c
    3  12   d
    > v <- rbind(a,b)
    > typeof(v)
    [1] "list"
    > v
      one two
    1   0   a
    2   1   a
    3   2   b
    4  10   c
    5  11   c
    6  12   d
    > w <- cbind(a,b)
    > typeof(w)
    [1] "list"
    > w
      one two one two
    1   0   a  10   c
    2   1   a  11   c
    3   2   b  12   d
    > names(w) = c("one","two","three","four")
    > w
      one two three four
    1   0   a    10    c
    , .toc-backref2   1   a    11    c
    3   2   b    12    d
    

    The same commands also work with vectors and matrices and behave in a similar manner.

    > A = matrix(c( 1, 2, 3, 4, 5, 6),ncol=3,byrow=TRUE)
    > A
         [,1] [,2] [,3]
    [1,]    1    2    3
    [2,]    4    5    6
    > B = matrix(c(10,20,30,40,50,60),ncol=3,byrow=TRUE)
    > B
         [,1] [,2] [,3]
    [1,]   10   20   30
    [2,]   40   50   60
    > V <- rbind(A,B)
    > typeof(V)
    [1] "double"
    > V
         [,1] [,2] [,3]
    [1,]    1    2    3
    [2,]    4    5    6
    [3,]   10   20   30
    [4,]   40   50   60
    > W <- cbind(A,B)
    > typeof(W)
    [1] "double"
    > W
         [,1] [,2] [,3] [,4] [,5] [,6]
    [1,]    1    2    3   10   20   30
    [2,]    4    5    6   40   50   60
    

    13.2. Applying Functions Across Data Elements

    The various apply functions can be an invaluable tool when trying to work with subsets within a data set. The different versions of the apply commands are used to take a function and have the function perform an operation on each part of the data. There are a wide variety of these commands, but we only look at two sets of them. The first set, lapply and sapply, is used to apply a function to every element in a list. The second one, tapply, is used to apply a function on each set broken up by a given set of factors.

    13.2.1. Operations on Lists and Vectors

    First, the lapply command is used to take a list of items and perform some function on each member of the list. That is, the list includes a number of different objects. You want to perform some operation on every object within the list. You can use lapply to tell R to go through each item in the list and perform the desired action on each item.

    In the following example a list is created with three elements. The first is a randomly generated set of numbers with a normal distribution. The second is a randomly generated set of numbers with an exponential distribution. The last is a set of factors. A summary is then performed on each element in the list.

    > x <- list(a=rnorm(200,mean=1,sd=10),
                b=rexp(300,10.0),
                c=as.factor(c("a","b","b","b","c","c")))
    > lapply(x,summary)
    $a
         Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
    -26.65000  -6.91200  -0.39250   0.09478   6.86700  32.00000
    
    $b
         Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
    0.0001497 0.0242300 0.0633300 0.0895400 0.1266000 0.7160000
    
    $c
    a b c
    1 3 2
    

    The lapply command returns a list. The entries in the list have the same names as the entries in the list that is passed to it. The values of each entry are the results from applying the function. The sapply function is similar, but the difference is that it tries to turn the result into a vector or matrix if possible. If it does not make sense then it returns a list just like the lapply command.

    > x <- list(a=rnorm(8,mean=1,sd=10),b=rexp(10,10.0))
    > x
    $a
    [1]  -0.3881426   6.2910959  13.0265859  -1.5296377   6.9285984 -28.3050569
    [7]  11.9119731  -7.6036997
    
    $b
    [1] 0.212689007 0.081818395 0.222462531 0.181424705 0.168476454 0.002924134
    [7] 0.007010114 0.016301837 0.081291728 0.055426055
    
    > val <- lapply(x,mean)
    > typeof(val)
    [1] "list"
    > val
    $a
    [1] 0.04146456
    
    $b
    [1] 0.1029825
    
    > val$a
    [1] 0.04146456
    > val$b
    [1] 0.1029825
    >
    >
    > other <- sapply(x,mean)
    > typeof(other)
    [1] "double"
    > other
             a          b
    0.04146456 0.10298250
    > other[1]
             a
    0.04146456
    > other[2]
             b
    0.1029825
    

    13.2.2. Operations By Factors

    Another widely used variant of the apply functions is the tapply function. The tapply function will take a list of data, usually a vector, a list of factors of the same list, and a function. It will then apply the function to each subset of the data that matches each of the factors.

    > val <- data.frame(a=c(1,2,10,20,5,50),
                        b=as.factor(c("a","a","b","b","a","b")))
    > val
       a b
    1  1 a
    2  2 a
    3 10 b
    4 20 b
    5  5 a
    6 50 b
    > result <- tapply(val$a,val$b,mean)
    > typeof(result)
    [1] "double"
    > result
           a         b
    2.666667 26.666667
    > result[1]
           a
    2.666667
    , .toc-backref> result[2]
           b
    26.66667
    > result <- tapply(val$a,val$b,summary)
    > typeof(result)
    [1] "list"
    > result
    $a
     Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    1.000   1.500   2.000   2.667   3.500   5.000
    
    $b
     Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    10.00   15.00   20.00   26.67   35.00   50.00
    
    > result$a
     Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    1.000   1.500   2.000   2.667   3.500   5.000
    > result$b
     Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    10.00   15.00   20.00   26.67   35.00   50.00
    

    14. Time Data Types

    Contents

    The time data types are broken out into a separate section from the introductory section on data types. (Basic Data Types) The reason for this is that dealing with time data can be subtle and must be done carefully because the data type can be cast in a variety of different ways. It is not an introductory topic, and if not done well can scare off the normal people.

    I will first go over the basic time data types and then explore the different kinds of operations that are done with the time data types. Please be cautious with time data and read the complete description including the caveats. There are some common mistakes that result in calculations that yield results that are very different from the intended values.

    14.1. Time and Date Variables

    There are a variety of different types specific to time data fields in R. Here we only look at two, the POSIXct and POSIXlt data types:

    POSIXct

    The POSIXct data type is the number of seconds since the start of January 1, 1970. Negative numbers represent the number of seconds before this time, and positive numbers represent the number of seconds afterwards.

    POSIXlt

    The POSIXlt data type is a vector, and the entries in the vector have the following meanings:

    1. seconds
    2. minutes
    3. hours
    4. day of month (1-31)
    5. month of the year (0-11)
    6. years since 1900
    7. day of the week (0-6 where 0 represents Sunday)
    8. day of the year (0-365)
    9. Daylight savings indicator (positive if it is daylight savings)

    Part of the difficulty with time data types is that R prints them out in a way that is different from how it stores them internally. This can make type conversions tricky, and you have to be careful and test your operations to insure that R is doing what you think it is doing.

    To get the current time, the Sys.time() can be used, and you can play around a bit with the basic types to get a feel for what R is doing. The as.POSIXct and as.POSIXlt commands are used to convert the time value into the different formats.

    > help(DateTimeClasses)
    > t <- Sys.time()
    > typeof(t)
    [1] "double"
    > t
    [1] "2014-01-23 14:28:21 EST"
    > print(t)
    [1] "2014-01-23 14:28:21 EST"
    > cat(t,"\n")
    1390505301
    > c <- as.POSIXct(t)
    > typeof(c)
    [1] "double"
    > print(c)
    [1] "2014-01-23 14:28:21 EST"
    > cat(c,"\n")
    1390505301
    >
    >
    > l <- as.POSIXlt(t)
    > l
    [1] "2014-01-23 14:28:21 EST"
    > typeof(l)
    [1] "list"
    > print(l)
    [1] "2014-01-23 14:28:21 EST"
    > cat(l,"\n")
    Error in cat(list(...), file, sep, fill, labels, append) :
    argument 1 (type 'list') cannot be handled by 'cat'
    > names(l)
    NULL
    > l[[1]]
    [1] 21.01023
    > l[[2]]
    [1] 28
    > l[[3]]
    [1] 14
    > l[[4]]
    [1] 23
    > l[[5]]
    [1] 0
    > l[[6]]
    [1] 114
    > l[[7]]
    [1] 4
    > l[[8]]
    [1] 22
    > l[[9]]
    [1] 0
    >
    > b <- as.POSIXct(l)
    > cat(b,"\n")
    1390505301
    

    There are times when you have a time data type and want to convert it into a string so it can be saved into a file to be read by another application. The strftime command is used to take a time data type and convert it to a string. You must supply an additional format string to let R what format you want to use. See the help page on strftime to get detailed information about the format string.

    > help(strftime)
    >
    > t <- Sys.time()
    > cat(t,"\n")
    1390506463
    > timeStamp <-  strftime(t,"%Y-%m-%d %H:%M:%S")
    > timeStamp
    [1] "2014-01-23 14:47:43"
    > typeof(timeStamp)
    [1] "character"
    

    Commonly a time stamp is saved in a data file, and it must be converted into a time data type to allow for calculations. For example, you may be interested in how much time has elapsed between two observations. The strptime command is used to take a string and convert it into a time data type. Like strftime it requires a format string in addition to the time stamp.

    The strptime command is used to take a string and convert it into a form that R can use for calculations. In the following example a data frame is defined that has the dates stored as strings. If you read the data in from a csv file this is how R will keep track of the data. Note that in this context the strings are assumed to represent ordinal data, and R will assume that the data field is a set of factors. You have to use the strptime command to convert it into a time field.

    > myData <- data.frame(time=c("2014-01-23 14:28:21","2014-01-23 14:28:55",
                                  "2014-01-23 14:29:02","2014-01-23 14:31:18"),
                          speed=c(2.0,2.2,3.4,5.5))
    > myData
                     time speed
    1 2014-01-23 14:28:21   2.0
    2 2014-01-23 14:28:55   2.2
    3 2014-01-23 14:29:02   3.4
    4 2014-01-23 14:31:18   5.5
    > summary(myData)
                     time       speed
    2014-01-23 14:28:21:1   Min.   :2.000
    2014-01-23 14:28:55:1   1st Qu.:2.150
    2014-01-23 14:29:02:1   Median :2.800
    2014-01-23 14:31:18:1   Mean   :3.275
                            3rd Qu.:3.925
                            Max.   :5.500
    > myData$time[1]
    [1] 2014-01-23 14:28:21
    4 Levels: 2014-01-23 14:28:21 2014-01-23 14:28:55 ... 2014-01-23 14:31:18
    > typeof(myData$time[1])
    [1] "integer"
    >
    >
    > myData$time <- strptime(myData$time,"%Y-%m-%d %H:%M:%S")
    > myData
                     time speed
    1 2014-01-23 14:28:21   2.0
    2 2014-01-23 14:28:55   2.2
    3 2014-01-23 14:29:02   3.4
    4 2014-01-23 14:31:18   5.5
    > myData$time[1]
    [1] "2014-01-23 14:28:21"
    > typeof(myData$time[1])
    [1] "list"
    > myData$time[1][[2]]
    [1] 28
    

    Now you can perform operations on the fields. For example you can determine the time between observations. (Please see the notes below on time operations. This example is a bit misleading!)

    > N = length(myData$time)
    > myData$time[2:N] - myData$time[1:(N-1)]
    Time differences in secs
    [1]  34   7 136
    attr(,"tzone")
    [1] ""
    

    In addition to the time data types R also has a date data type. The difference is that the date data type keeps track of numbers of days rather than seconds. You can cast a string into a date type using the as.Date function. The as.Date function takes the same arguments as the time data types discussed above.

    > theDates <- c("1 jan 2012","1 jan 2013","1 jan 2014")
    > dateFields <- as.Date(theDates,"%d %b %Y")
    > typeof(dateFields)
    [1] "double"
    > dateFields
    [1] "2012-01-01" "2013-01-01" "2014-01-01"
    > N <- length(dateFields)
    > diff <- dateFields[1:(N-1)] - dateFields[2:N]
    > diff
    Time differences in days
    [1] -366 -365
    

    You can also define a date in terms of the number days after another date using the origin option.

    > infamy <- as.Date(-179,origin="1942-06-04")
    > infamy
    [1] "1941-12-07"
    >
    > today <- Sys.Date()
    > today
    [1] "2014-01-23"
    > today-infamy
    Time difference of 26345 days
    

    Finally, a nice function to know about and use is the format command. It can be used in a wide variety of situations, and not just for dates. It is helpful for dates, though, because you can use it in cat and print statements to make sure that your output is in exactly the form that you want.

    > theTime <- Sys.time()
    > theTime
    [1] "2014-01-23 16:15:05 EST"
    > a <- rexp(1,0.1)
    > a
    [1] 7.432072
    > cat("At about",format(theTime,"%H:%M"),"the time between occurances was around",format(a,digits=3),"seconds\n")
    At about 16:15 the time between occurances was around 7.43 seconds
    

    14.2. Time Operations

    The most difficult part of dealing with time data can be converting it into the right format. Once a time or date is stored in R’s internal format then a number of basic operations are available. The thing to keep in mind, though, is that the units you get after an operation can vary depending on the magnitude of the time values. Be very careful when dealing with time operations and vigorously test your codes.

    > now <- Sys.time()
    > now
    [1] "2014-01-23 16:31:00 EST"
    > now-60
    [1] "2014-01-23 16:30:00 EST"
    >
    > earlier <- strptime("2000-01-01 00:00:00","%Y-%m-%d %H:%M:%S")
    > later <- strptime("2000-01-01 00:00:20","%Y-%m-%d %H:%M:%S")
    > later-earlier
    Time difference of 20 secs
    > as.double(later-earlier)
    [1] 20
    >
    > earlier <- strptime("2000-01-01 00:00:00","%Y-%m-%d %H:%M:%S")
    > later <- strptime("2000-01-01 01:00:00","%Y-%m-%d %H:%M:%S")
    > later-earlier
    Time difference of 1 hours
    > as.double(later-earlier)
    [1] 1
    >
    > up <- as.Date("1961-08-13")
    > down <- as.Date("1989-11-9")
    > down-up
    Time difference of 10315 days
    

    The two examples involving the variables earlier and later in the previous code sample should cause you a little concern. The value of the difference depends on the largest units with respect to the difference! The issue is that when you subtract dates R uses the equivalent of the difftime command. We need to know how this operates to reduce the ambiguity when comparing times.

    > help(difftime)
    >
    > earlier <- strptime("2000-01-01 00:00:00","%Y-%m-%d %H:%M:%S")
    > later <- strptime("2000-01-01 01:00:00","%Y-%m-%d %H:%M:%S")
    > difftime(later,earlier)
    Time difference of 1 hours
    > difftime(later,earlier,units="secs")
    Time difference of 3600 secs
    

    One thing to be careful about difftime is that it is a double precision number, but it has units attached to it. This can be tricky, and you should be careful about the ambiguity in using this command. I personally always try to specify the units to avoid this.

    > earlier <- strptime("2000-01-01 00:00:00","%Y-%m-%d %H:%M:%S")
    > later <- strptime("2000-01-01 00:00:20","%Y-%m-%d %H:%M:%S")
    > d <- difftime(later,earlier)
    > d
    Time difference of 20 secs
    > typeof(d)
    [1] "double"
    > as.double(d)
    [1] 20
    

    Another way to define a time difference is to use the as.difftime command. It takes two dates and will compute the difference between them. It takes a time, its format, and the units to use. Note that in the following example R is able to figure out what the units are when making the calculation.

    > diff <- as.difftime("00:30:00","%H:%M:%S",units="hour")
    > diff
    Time difference of 0.5 hours
    > Sys.time()
    [1] "2014-01-23 16:45:39 EST"
    > Sys.time()+diff
    [1] "2014-01-23 17:15:41 EST"
    

    The last thing to mention is that once a time stamp is cast into one of R’s internal formats comparisons can be made in a natural way.

    > diff <- as.difftime("00:30:00","%H:%M:%S",units="hour")
    > now <- Sys.time()
    > later <- now + diff
    > now
    , .toc-backref[1] "2014-01-23 16:47:48 EST"
    > later
    [1] "2014-01-23 17:17:48 EST"
    >
    > if(now < later)
      {
         cat("there you go\n")
      }
    there you go
    

    15. Introduction to Programming

    Contents

    We look at running commands from a source file. We also include an overview of the different statements that are used for control-flow that determines which code is executed by the interpreter.

    Contents

    In the next section the ways to execute the commands in a file using the source command are given. The remaining sections are used to list the various flow control options that are available in the R language definition. The language definition has a wide variety of control functions which can be found using the help command.

    > help(Control)
    >
    

    15.1.1. Executing the commands in a File

    A set of R commands can be saved in a file and then executed as if you had typed them in from the command line. The source command is used to read the file and execute the commands in the same sequence given in the file.

    > source('file.R')
    > help(source)
    >
    

    If you simply source the file the commands are not printed, and the results of commands are not printed. This can be overridden using the echo, print.eval, and verbose options.

    Some examples are given assuming that a file, simpleEx.R, is in the current directory. The file is given below:

    # Define a variable.
    x <- rnorm(10)
    
    # calculate the mean of x and print out the results.
    mux = mean(x)
    cat("The mean of x is ",mean(x),"\n")
    
    # print out a summary of the results
    summary(x)
    cat("The summary of x is \n",summary(x),"\n")
    print(summary(x))
    

    The file also demonstrates the use of # to specify comments. Anything after the # is ignored. Also, the file demonstrates the use of cat and print to send results to the standard output. Note that the commands have options to send results to a file. Use help for more information.

    The output for the different options can be found below:

    > source('simpleEx.R')
    The mean of x is  -0.4817475
    The summary of x is
    -2.24 -0.5342 -0.2862 -0.4817 -0.1973 0.4259
        Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
     -2.2400 -0.5342 -0.2862 -0.4817 -0.1973  0.4259
    >
    >
    >
    > source('simpleEx.R',echo=TRUE)
        Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
    -2.32600 -0.69140 -0.06772 -0.13540  0.46820  1.69600
    >
    >
    >
    > source('simpleEx.R',print.eval=TRUE)
    The mean of x is  0.1230581
        Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    -1.7020 -0.2833  0.1174  0.1231  0.9103  1.2220
    The summary of x is
    -1.702 -0.2833 0.1174 0.1231 0.9103 1.222
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    -1.7020 -0.2833  0.1174  0.1231  0.9103  1.2220
    >
    >
    >
    > source('simpleEx.R',print.eval=FALSE)
    The mean of x is  0.6279428
    The summary of x is
    -0.7334 -0.164 0.9335 0.6279 1.23 1.604
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    -0.7334 -0.1640  0.9335  0.6279  1.2300  1.6040
    >
    >
    >
    >
    > source('simpleEx.R',verbose=TRUE)
    'envir' chosen:<environment: R_GlobalEnv>
    encoding = "native.enc" chosen
    --> parsed 6 expressions; now eval(.)ing them:
    
    >>>> eval(expression_nr. 1 )
                     =================
    
    > # Define a variable.
    > x <- rnorm(10)
    curr.fun: symbol <-
     .. after ‘expression(x <- rnorm(10))’
    
    >>>> eval(expression_nr. 2 )
                     =================
    
    > # calculate the mean of x and print out the results.
    > mux = mean(x)
    curr.fun: symbol =
     .. after ‘expression(mux = mean(x))’
    
    >>>> eval(expression_nr. 3 )
                     =================
    
    > cat("The mean of x is ",mean(x),"\n")
    The mean of x is  -0.1090932
    curr.fun: symbol cat
     .. after ‘expression(cat("The mean of x is ",mean(x),"\n"))’
    
    >>>> eval(expression_nr. 4 )
                     =================
    
    > # print out a summary of the results
    > summary(x)
    curr.fun: symbol summary
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    -1.3820 -1.0550 -0.1995 -0.1091  0.6813  2.1050
     .. after ‘expression(summary(x))’
    
    >>>> eval(expression_nr. 5 )
                     =================
    
    > cat("The summary of x is \n",summary(x),&q, .toc-backrefuot;\n")
    The summary of x is
     -1.382 -1.055 -0.1995 -0.1091 0.6813 2.105
    curr.fun: symbol cat
     .. after ‘expression(cat("The summary of x is \n",summary(x),"\n"))’
    
    >>>> eval(expression_nr. 6 )
                     =================
    
    > print(summary(x))
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    -1.3820 -1.0550 -0.1995 -0.1091  0.6813  2.1050
    curr.fun: symbol print
     .. after ‘expression(print(summary(x)))’
    

    One common problem that occurs is that R may not know where to find a file.

    > source('notThere.R')
    Error in file(filename, "r", encoding = encoding) :
      cannot open the connection
    In addition: Warning message:
    In file(filename, "r", encoding = encoding) :
      cannot open file 'notThere.R': No such file or directory
    

    R will search the current working directory. You can see what files are in the directory using the dir command, and you can determine the current directory using the getwd command.

    > getwd()
    [1] "/home/black/public_html/tutorial/R/rst/source/R"
    > dir()
    [1] "plotting.rData" "power.R"        "shadedRegion.R"
    

    You can change the current directory, and the options available depend on how you are using R. For example on a Windows PC or a Macintosh you can use the menu options to change the working directory. You can choose the directory using a graphical file browser. Otherwise, you can change to the correct directory before running R or use the setwd command.

    15.1.2. if statements

    Conditional execution is available using the if statement and the corresponding else statement.

    > x = 0.1
    > if( x < 0.2)
      {
         x <- x + 1
         cat("increment that number!\n")
      }
    , .toc-backrefincrement that number!
    > x
    [1] 1.1
    

    The else statement can be used to specify an alternate option. In the example below note that the else statement must be on the same line as the ending brace for the previous if block.

    > x = 2.0
    > if ( x < 0.2)
     {
        x <- x + 1
        cat("increment that number!\n")
     } else
     {
        x <- x - 1
        cat("nah, make it smaller.\n");
     }
    nah, make it smaller.
    > x
    [1] 1
    

    Finally, the if statements can be chained together for multiple options. The if statement is considered a single code block, so more if statements can be added after the else.

    > x = 1.0
    > if ( x < 0.2)
     {
        x <- x + 1
        cat("increment that number!\n")
     } else if ( x < 2.0)
     {
       x <- 2.0*x
       cat("not big enough!\n")
     } else
     {
        x <- x - 1
    , .toc-backref    cat("nah, make it smaller.\n");
     }
    not big enough!
    > x
    [1] 2
    

    The argument to the if statement is a logical expression. A full list of logical operators can be found in the types document focusing on logical variables (Logical).

    15.1.3. for statements

    The for loop can be used to repeat a set of instructions, and it is used when you know in advance the values that the loop variable will have each time it goes through the loop. The basic format for the for loop is for(var in seq) expr

    An example is given below:

    > for (lupe in seq(0,1,by=0.3))
     {
        cat(lupe,"\n");
     }
    0
    0.3
    0.6
    0.9
    >
    > x <- c(1,2,4,8,16)
    > for (loop in x)
     {
        cat("value of loop: ",loop,"\n");
     }
    value of loop:  1
    value of loop:  2
    value of loop:  4
    value of loop:  8
    value of loop:  16
    

    See the section on breaks for more options (break and next statements)

    15.1.4. while statements

    The while loop can be used to repeat a set of instructions, and, .toc-backref it is often used when you do not know in advance how often the instructions will be executed. The basic format for a while loop is while(cond) expr

    >
    > lupe <- 1;
    > x <- 1
    > while(x < 4)
     {
        x <- rnorm(1,mean=2,sd=3)
        cat("trying this value: ",x," (",lupe," times in loop)\n");
        lupe <- lupe + 1
     }
    trying this value:  -4.163169  ( 1  times in loop)
    trying this value:  3.061946  ( 2  times in loop)
    trying this value:  2.10693  ( 3  times in loop)
    trying this value:  -2.06527  ( 4  times in loop)
    trying this value:  0.8873237  ( 5  times in loop)
    trying this value:  3.145076  ( 6  times in loop)
    trying this value:  4.504809  ( 7  times in loop)
    

    See the section on breaks for more options (break and next statements)

    15.1.5. repeat statements

    The repeat loop is similar to the while loop. The difference is that it will always begin the loop the first time. The while loop will only start the loop if the condition is true the first time it is evaluated. Another difference is that you have to explicitly specify when to stop the loop using the break command.

    That is you need to execute the break statement to get out of the loop.

    >  repeat
    {
        x <- rnorm(1)
        if(x < -2.0) break
    }
    > x
    [1] -2.300532
    

    See the section on breaks for more options (break and next statements)

    15.1.6. break and next statements

    The break statement is used to stop the execution of the current loop. It will break out of the current loop. The next statement is used to skip the statements that follow and restart the current loop. If a for loop is used then the next statement will update the loop variable.

    > x <- rnorm(5)
    > x
    [1]  1.41699338  2.28086759 -0.01571884  0.56578443  0.60400784
    > for(lupe in x)
     {
         if (lupe > 2.0)
             next
    
         if( (lupe<0.6) && (lupe > 0.5))
            break
    
        cat("The value of lupe is ",lupe,"\n");
     }
    The value of lupe is  1.416993
    The value of lupe is  -0.01571884
    

    15.1.7. switch statement

    The switch takes an expression and returns a value in a list based on the value of the expression. How it does this depends on the data type of the expression. The basic syntax is switch(statement,item1,item2,item3,...,itemN).

    If the result of the expression is a number then it returns the item in the list with the same index. Note that the expression is cast as an integer if it is not an integer.

    > x <- as.integer(2)
    > x
    [1] 2
    > z = switch(x,1,2,3,4,5)
    > z
    [1] 2
    > x <- 3.5
    > z = switch(x,1,2,3,4,5)
    > z
    [1] 3
    

    If the result of the expression is a string, then the list of items should be in the form “valueN”=resultN, and the statement will return the result that matches the value.

    > y <- rnorm(5)
    > y
    [1]  0.4218635 -0.8205637 -1.0191267 -0.6080061 -0.6079133
    > x <- "sd"
    > z <- switch(x,"mean"=mean(y),"median"=median(y),"variance"=var(y),"sd"=sd(y))
    > z
    [1] 0.5571847
    > x <- "median"
    > z <- switch(x,"mean"=mean(y),"median"=median(y),"variance"=var(y),"sd"=sd(y))
    > z
    [1] -0.6080061
    

    15.1.8. scan statement

    The command to read input from the keyboard is the scan statement. It has a wide variety of options and can be fine tuned to your specific needs. We only look at the basics here. The scan statement waits for input from a user, and it returns the value that was typed in.

    When using the command with no set number of lines the command will continue to read keyboard input until a blank line is entered.

    > help(scan)
    > a <- scan(what=double(0))
    1: 3.5
    2:
    Read 1 item
    > a
    [1] 3.5
    > typeof(a)
    [1] "double"
    >
    > a <- scan(what=double(0))
    1: yo!
    1:
    Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
    scan() expected 'a real', got 'yo!'
    

    If you wish to only have it read from a fixed number of lines the nmax option can specify how many lines can be typed in, and the multi.line option can be used to turn off multi-line entry.

    > a <-  scan(what=double(0),nmax=1,multi.line = FALSE)
    1: 6.7
    Read 1 item
    > a
    [1] 6.7
    

    15.2. Functions

    A shallow overview of defining functions is given here. A few subtleties will be noted, but R can be a little quirky with respect to defining functions. The first bit of oddness is that you can think of a function as an object where you define the function and assign it to a variable name.

    To define a function you assign it to a name, and the keyword function is used to denote the start of the function and its argument list.

    > newDef <- function(a,b)
     {
         x = runif(10,a,b)
         mean(x)
     }
    > newDef(-1,1)
    [1] 0.06177728
    > newDef
    function(a,b)
    {
       x = runif(10,a,b)
       mean(x)
    }
    

    The last expression in the function is what is returned. So in the example above the sample mean of the numbers is returned.

    > x <- newDef(0,1)
    > x
    [1] 0.4800442
    

    The arguments that are passed are matched in order. They can be specified explicitly, though.

    > newDef(b=10,a=1)
    [1] 4.747509
    > newDef(10,1)
    [1] NaN
    Warning message:
    In runif(10, a, b) : NAs produced
    

    You can mix this approach, and R will try to match up the named arguments and then match the rest going from left to right. Another bit of weirdness is that R will not evaluate an expression in the argument list until the moment it is needed in the function. This is a different kind of behavior than what most people are used to, so be very careful about this. The best rule of thumb is to not put in operations in an argument list if they matter after the function is called.

    Another common task is to have a function return multiple items. This can be accomplished by returning a list of items. The objects within a list can be accessed using the same $ notation that is used for data frames.

    > c = c(1,2,3,4,5)
    > sample <- function(a,b)
    {
      value = switch(a,"median"=median(b),"mean"=mean(b),"variance"=var(b))
      largeVals = length(c[c>1])
      list(stat=value,number=largeVals)
    }
    > result <- sample("median",c)
    > result
    $stat
    [1] 3
    
    $number
    [1] 4
    
    > result$stat
    [1] 3
    > result$number
    [1] 4
    

    There is another potential problem that can occur when using a function in R. When it comes to determining the value of a variable there is a path that R will use to search for its value. In the case of functions if a previously undefined variable appears R will look at the argument list for the function. Next it will look in the current work space. If you are not careful R will find the value some place where you do not expect it, and your function will return a value that is not correct, and no error will be given. Be very careful about the names of variables especially when using functions.

    Object Oriented Programming

    There are at least three different approaches to object oriented programming in R. We examine two of them, the S3 and S4 classes. The other approach makes use of a package, and we focus instead on the two built in classes.

    1. S3 Classes

    Contents

    We examine how to create S3 classes. It is assumed that you are familiar with the basic data types and scripting (Introduction to Programming).

    1.1. The Basic Idea

    First, everything in R is treated like as an object. We have seen this with functions. Many of the objects that are created within an R session have attributes associated with them. One common attribute associated with an object is its class.

    You can set the class attribute using the class command. One thing to notice is that the class is a vector which allows an object to inherit from multiple classes, and it allows you to specify the order of inheritance for complex classes. You can also use the class command to determine the classes associated with an object.

    > bubba <- c(1,2,3)
    > bubba
    [1] 1 2 3
    >
    > class(bubba)
    [1] "numeric"
    >
    > class(bubba) <- append(class(bubba),"Flamboyancy")
    > class(bubba)
    [1] "numeric"     "Flamboyancy"
    

    Note

    A new command, append, is used here. The first argument is a vector, and the function adds the following arguments to the end of the vector.

    One way to define a method for a class is to use the UseMethod command to define a hierarchy of functions that will react appropriately. The UseMethod command will tell R to look for a function whose prefix matches the current function, and it searches for a suffix in order from the vector of class names. In other words a set of functions can be defined, and the function that is actually called will be determined by the class name of the first object in the list of arguments.

    You first have, .toc-backref to define a generic function to reserve the function name. The UseMethod command is then used to tell the R system to search for a different function. The search is based on the name of the function and the names of an object’s classes. The name of the functions have two parts separated by a ”.” where the prefix is the function name and the suffix is the name of a class.

    That is a lot of verbiage 罗嗦 to describe a relatively simple idea. A very basic example is given below:

    > bubba <- list(first="one", second="two", third="third")
    > class(bubba) <- append(class(bubba),"Flamboyancy")
    >
    > bubba
    $first
    [1] "one"
    
    $second
    [1] "two"
    
    $third
    [1] "third"
    
    attr(,"class")
    [1] "list"        "Flamboyancy"
    >
    > GetFirst <- function(x)
    + {
    +     UseMethod("GetFirst",x)
    + }
    >
    > GetFirst.Flamboyancy <- function(x)
    + {
    +    return(x$first)
    + }
    >
    > GetFirst(bubba)
    [1] "one"
    

    1.2. Memory Management

    The plethora of object oriented approaches leads to a natural question. Which one should you use? With respect to S3 and S4 classes, the S3 class is more flexible, and the S4 class is a more structured approach. This is a nice way of saying that the S3 class approach is for unaware slobs and is a sloppy way to shoot yourself in the foot, while the S4 class is for uptight pedants.

    Our focus here is on S3 classes. Before we delve into the details of S3 classes we need to talk about memory environments. These can be used to great effect in S3 classes to make your codes totally incomprehensible. On the down side they help give S3 classes their flexibility.

    An environment can be thought of as a local scope. It has a set of variables associa, .toc-backrefted with it. You can access those variables if you have the “ID’’ associated with the environment. There are a number of commands you can use to manipulate and obtain the pointers to your environments. You can also use the assign and get commands to set and get the values of variables within an environment.

    The environment command can be used to get the pointer to the current environment.

    > ls()
    character(0)
    > e <- environment()
    > e
    <environment: R_GlobalEnv>
    > assign("bubba",3,e)
    > ls()
    [1] "bubba" "e"
    > bubba
    [1] 3
    > get("bubba",e)
    [1] 3
    

    Environments can be created and embedded within other environments and can be structured to form a hierarchy. There are a number of commands to help you move through different environments. You can find more details using the command help(environment), but we do not pursue more details because this is as much as we need for our purposes of using S3 classes.

    1.3. Creating an S3 Class

    The basic ideas associated with S3 classes is discussed in the first section (The Basic Idea). We now expand on that idea and demonstrate how to define a function that will create and return an object of a given class. The basic idea is that a list is created with the relevant members, the list’s class is set, and a copy of the list is returned.

    Here we examine two different ways to construct an S3 class. The first approach is more commonly used and is more straightforward. It makes use of basic list properties. The second approach makes use of the local environment within a function to define the variables tracked by the class. The advantage to the second approach is that it looks more like the object oriented approach that many are familiar with. The disadvantage is that it is more difficult to read the code, and it is more like working with pointers which is different from the way other objects work in R.

    1.3.1. Straight Forward Approach

    The first approach is the more standard approach most often seen with S3 classes. It makes use of the methods defined outside of the class which is described below, Creating Methods. It also keeps track of the data maintained by the class using the standard practices associated with lists.

    The basic idea is that a function is defined which creates a list. The data entries tracked by the class are defined in the list. In the example below the defaults are specified in the argument list with default values assigned. A new class name is appended to the list’s classes, and the list is returned.

    NorthAmerican <- function(eatsBreakfast=TRUE,myFavorite="cereal")
    {, .toc-backref
    
            me <- list(
                    hasBreakfast = eatsBreakfast,
                    favoriteBreakfast = myFavorite
           )
    
            ## Set the name for the class
            class(me) <- append(class(me),"NorthAmerican")
            return(me)
    }
    

    Once this definition is executed a new function is defined, called NorthAmerican. A new object of this class can be created by calling the function.

    > bubba <- NorthAmerican()
    > bubba
    $hasBreakfast
    [1] TRUE
    
    $favoriteBreakfast
    [1] "cereal"
    
    attr(,"class")
    [1] "list"          "NorthAmerican"
    > bubba$hasBreakfast
    [1] TRUE
    >
    > louise <- NorthAmerican(eatsBreakfast=TRUE,myFavorite="fried eggs")
    > louise
    $hasBreakfast
    [1] TRUE
    
    $favoriteBreakfast
    [1] "fried eggs"
    
    attr(,"class")
    [1] "list"          "NorthAmerican"
    

    1.3.2. Local Environment Approach

    Another approach can be employed that makes use of the local environment within a function to access the variables. When we define methods with this approach later, Local Environment Approach, the results will look more like object oriented approaches seen in other languages.

    The approach relies on the local scope created when a function is called. A new environment is created that can be identified using the environment command. The environment can be saved in the list created for the class, and the variables within this scope can then be accessed using the identification of the environment.

    In the example below this approach appears to require more overhead. When we examine how to add external methods to the class the advantage will be a little clearer.

    NordAmericain <- function(eatsBreakfast=TRUE,myFavorite="cereal")
    {
    
          ## Get the environment for this
          ## instance of the function.
          thisEnv <- environment()
    
          hasBreakfast <- eatsBreakfast
          favoriteBreakfast <- myFavorite
    
          ## Create the list used to represent an
          ## object for this class
          me <- list(
    
            ## Define the environment where this list is defined so
            ## that I can refer to it later.
            thisEnv = thisEnv,
    
            ## The Methods for this class normally go here but are discussed
            ## below. A simple placeholder is here to give you a teaser....
            getEnv = function()
            {
                  return(get("thisEnv",thisEnv))
            }
    
            )
    
          ## Define the value of the list within the current environment.
          assign('this',me,envir=thisEnv)
    
          ## Set the name for the class
          class(me) <- append(class(me),"NordAmericain")
          return(me)
    }
    

    Now that the class is defined, the environment used for a given object can be easily retrieved.

    > bubba <- NordAmericain()
    > get("hasBreakfast",bubba$getEnv())
    [1] TRUE
    > get("favoriteBreakfast",bubba$getEnv())
    [1] "cereal"
    

    Note that there is an unfortunate side effect to this approach. By keeping track of the environment, it is similar to using a pointer to the variables rather than the variables themselves. This means when you make a copy, you are making a copy of the pointer to the environment.

    > bubba <- NordAmericain(myFavorite="oatmeal")
    > get("favoriteBreakfast",bubba$getEnv())
    [1] "oatmeal"
    > louise <- bubba
    > assign("favoriteBreakfast","toast",louise$getEnv())
    > get("favoriteBreakfast",louise$getEnv())
    [1] "toast"
    > get("favoriteBreakfast",bubba$getEnv())
    [1] "toast"
    

    This issue will be explored again in the subsection below detailing how to create methods for an S3 class. If you wish to be able to copy an object using this approach you need to create a new method to return a proper copy.

    1.4. Creating Methods

    We now explore how to create methods associated with a class. Again we break it up into two parts. The first approach is used for both approaches discussed in the previous section. If you make use of the local environment approach discussed above you will likely make use of both approaches discussed in this section. If you only make use of the more straight forward approach you only need to be aware of the first approach discussed here.

    1.4.1. Straight Forward Approach

    The first approach is to define a function that exists outside of the class. The function is defined in a generic way, and then a function specific to a given class is defined. The R environment then decides which function to use based on the class names of an argument to the function, and the suffix used in the names of the associated functions.

    One thing to keep in mind is that for assignment R makes copies of objects. The implication is that if you change a part of an object you need to return an exact copy of the object. Otherwise your changes may be lost.

    In the examples below we define accessors for the variables in the class defined above. We assume that the class NorthAmerican is defined in the same way as the first example above, Straightforward Class. In the first example the goal is that we want to create a function that will set the value of hasBreakfast for a given object. The name of the function will be setHasBreakfast.

    The first step is to reserve the name of the function, and use the UseMethod command to tell R to search for the correct function. If we pass an object whose class includes the name NorthAmerican then the correct function to call should be called setHasBreakfast.NorthAmerican. Note that we will also create a function called setHasBreakfast.default. This function will be called if R cannot find another function of the correct class.

    setHasBreakfast <- function(elObjeto, newValue)
            {
                    print("Calling the base setHasBreakfast function")
                    UseMethod("setHasBreakfast",elObjeto)
                    print("Note this is not executed!")
            }
    
    setHasBreakfast.default <- function(elObjeto, newValue)
            {
                    print("You screwed up. I do not know how to handle this object.")
                    return(elObjeto)
            }
    
    
    setHasBreakfast.NorthAmerican <- function(elObjeto, newValue)
            {
                    print("In setHasBreakfast.NorthAmerican and setting the value")
                    elObjeto$hasBreakfast <- newValue
                    return(elObjeto)
            }
    

    The first thing to note is that the function returns a copy of the object passed to it. R passes copies of objects to functions. If you change an object within a function it does not change the original object. You must pass back a copy of the updated object.

    > bubba <- NorthAmerican()
    > bubba$hasBreakfast
    [1] TRUE
    > bubba <- setHasBreakfast(bubba,FALSE)
    [1] "Calling the base setHasBreakfast function"
    [1] "In setHasBreakfast.NorthAmerican and setting the value"
    > bubba$hasBreakfast
    [1] FALSE
    > bubba <- setHasBreakfast(bubba,"No type checking sucker!")
    [1] "Calling the base setHasBreakfast function"
    [1] "In setHasBreakfast.NorthAmerican and setting the value"
    > bubba$hasBreakfast
    [1] "No type checking sucker!"
    

    If the correct function cannot be found then the default version of the function is called.

    > someNumbers <- 1:4
    > someNumbers
    [1] 1 2 3 4
    > someNumbers <- setHasBreakfast(someNumbers,"what?")
    [1] "Calling the base setHasBreakfast function"
    [1] "You screwed up. I do not know how to handle this object."
    > someNumbers
    [1] 1 2 3 4
    

    It is a good practice to only use predefined accessors to get and set values held by an object. As a matter of completeness we define methods to get the value of the hasBreakfast field.

    getHasBreakfast <- function(elObjeto)
            {
                    <, .toc-backrefspan class="kp">print("Calling the base getHasBreakfast function")
                    UseMethod("getHasBreakfast",elObjeto)
                    print("Note this is not executed!")
            }
    
    getHasBreakfast.default <- function(elObjeto)
            {
                    print("You screwed up. I do not know how to handle this object.")
                    return(NULL)
            }
    
    
    getHasBreakfast.NorthAmerican <- function(elObjeto)
            {
                    print("In getHasBreakfast.NorthAmerican and returning the value")
                    return(elObjeto$hasBreakfast)
            }
    

    The functions to get the values are used in the same way.

    > bubba <- NorthAmerican()
    > bubba <- setHasBreakfast(bubba,"No type checking sucker!")
    [1] "Calling the base setHasBreakfast function"
    [1] "In setHasBreakfast.NorthAmerican and setting the value"
    > result <- getHasBreakfast(bubba)
    [1] "Calling the base getHasBreakfast function"
    [1] "In getHasBreakfast.NorthAmerican and returning the value"
    > result
    [1] "No type checking sucker!"
    

    1.4.2. Local Environment Approach

    If the second method for defining an S3 class is used as seen above, Local Environment Class, then the approach for defining a method can include an additional way to define a method. In this approach functions can be defined within the list that defines the object.

    NordAmericain <- function(eatsBreakfast=TRUE,myFavorite="cereal")
    {
    
          ## Get the environment for this
          ## instance of the function.
          thisEnv <- environment()
    
          hasBreakfast <- eatsBreakfast
          favoriteBreakfast <- myFavorite
    
          ## Create the list used to represent an
          ## object for this class
          me <- list(
    
                  ## Define the environment where this list is defined so
                  ## that I can refer to it later.
                  thisEnv = thisEnv,
    
                  ## Define the accessors for the data fields.
                  getEnv = function()
                  {
                          return(get("thisEnv",thisEnv))
                  },
    
                  getHasBreakfast = function()
                  {
                          return(get("hasBreakfast",thisEnv))
                  },
    
                  setHasBreakfast = function(value)
                  {
                          return(assign("hasBreakfast",value,thisEnv))
                  },
    
    
                  getFavoriteBreakfast = function()
                  {
                          return(get("favoriteBreakfast",thisEnv))
                  },
    
                  setFavoriteBreakfast = function(value)
                  {
                          return(assign("favoriteBreakfast",value,thisEnv))
                  }
    
            )
    
          ## Define the value of the list within the current environment.
          assign('this',me,envir=thisEnv)
    
          ## Set the name for the class
          class(me) <- append(class(me),"NordAmericain")
          return(me)
    }
    

    With this definition the methods can be called in a more direct manner.

    > bubba <- NordAmericain(myFavorite="oatmeal")
    > bubba$getFavoriteBreakfast()
    [1] "oatmeal"
    > bubba$setFavoriteBreakfast("plain toast")
    > bubba$getFavoriteBreakfast()
    [1] "plain toast"
    

    As noted above, Local Environment Class, this approach can be problematic when making a copy of an object. If you need to make copies of your objects a function must be defined to explicitly make a copy.

    makeCopy <- function(elObjeto)
            {
                    print("Calling the base makeCopy function")
                    UseMethod("makeCopy",elObjeto)
                    print("Note this is not executed!")
            }
    
    makeCopy.default <- function(elObjeto)
            {
                    print("You screwed up. I do not know how to handle this object.")
                    return(elObjeto)
            }
    
    
    makeCopy.NordAmericain <- function(elObjeto)
            {
                    print("In makeCopy.NordAmericain and making a copy")
                    newObject <- NordAmericain(
                            eatsBreakfast=elObjeto$getHasBreakfast(),
                            myFavorite=elObjeto$getFavoriteBreakfast())
                    return(newObject)
            }
    

    With this definition we can now make a proper copy of the object and get the expected results.

    > bubba <- NordAmericain(eatsBreakfast=FALSE,myFavorite="oatmeal")
    > louise <- makeCopy(bubba)
    [1] "Calling the base makeCopy function"
    [1] "In makeCopy.NordAmericain and making a copy"
    > louise$getFavoriteBreakfast()
    [1] "oatmeal"
    > louise$setFavoriteBreakfast("eggs")
    > louise$getFavoriteBreakfast()
    [1] "eggs"
    > bubba$getFavoriteBreakfast()
    [1] "oatmeal"
    

    1.5. Inheritance

    Inheritance is part of what makes it worthwhile to go to the effort of making up a proper class. The basic idea is that another class can be constructed that makes use of all the data and methods of a base class and builds on them by adding additional data and methods.

    The basic idea is that an object’s class is a vector that contains an ordered list of classes that an object is a member of. When a new object is created it can add its class name to that list. The methods associated with the class can use the NextMethod command to search for the function associated with the next class in the list.

    In the examples below we build on the example of the NorthAmerican class defined above. The examples below assume that the NorthAmerican class given above is defined, and the class hierarchy is shown in Figure 1..

    Diagram of the NorthAmerican derived classes.

    Figure 1.

    Diagram of the NorthAmerican derived classes.

    A file with the full script is available at s3Inheritance.R . We do not provide the full code set here in order to keep the discussion somewhat under control. Our first step is to define the new classes.

    Mexican <- function(eatsBreakfast=TRUE,myFavorite="los huevos")
    {
    
            me <- NorthAmerican(eatsBreakfast,myFavorite)
    
            ## Add the name for the class
            class(me) <- append(class(me),"Mexican")
            return(me)
    }
    
    
    USAsian <- function(eatsBreakfast=TRUE,myFavorite="pork belly")
    {
    
            me <- NorthAmerican(eatsBreakfast,myFavorite)
    
            ## Add the name for the class
            class(me) <- append(class(me),"USAsian")
            return(me)
    }
    
    Canadian <- function(eatsBreakfast=TRUE,myFavorite="back bacon")
    {
    
            me <- NorthAmerican(eatsBreakfast,myFavorite)
    
            ## Add the name for the class
            class(me) <- append(class(me),"Canadian")
            return(me)
    }
    
    Anglophone <- function(eatsBreakfast=TRUE,myFavorite="pancakes")
    {
    
            me <- Canadian(eatsBreakfast,myFavorite)
    
            ## Add the name for the class
            class(me) <- append(class(me),"Anglophone")
            return(me)
    }
    
    Francophone <- function(eatsBreakfast=TRUE,myFavorite="crepes")
    {
    
            me <- Canadian(eatsBreakfast,myFavorite)
    
            ## Add the name for the class
            class(me) <- append(class(me),"Francophone")
            return(me)
    }
    

    With these definitions we can define an object. In this case we create an object from the Francophone class.

    > francois <- Francophone()
    > francois
    $hasBreakfast
    [1] TRUE
    
    $favoriteBreakfast
    [1] "crepes"
    
    attr(,"class")
    [1] "list"          "NorthAmerican" "Canadian"      "Francophone"
    

    The thing to notice is that the class vector demonstrates the class inheritance structure. We can now define a method, makeBreakfast which will make use of the class structure.

    makeBreakfast <- function(theObject)
            {
                    print("Calling the base makeBreakfast function")
                    UseMethod("makeBreakfast",theObject)
            }
    
    makeBreakfast.default <- function(theObject)
            {
                    print(noquote(paste("Well, this is awkward. Just make",
                                      getFavoriteBreakfast(theObject))))
                    return(theObject)
            }
    
    makeBreakfast.Mexican <- function(theObject)
            {
                    print(noquote(paste("Estoy cocinando",
                                           getFavoriteBreakfast(theObject))))
                    NextMethod("makeBreakfast",theObject)
                    return(theObject)
            }
    
    makeBreakfast.USAsian <- function(theObject)
            {
                    print(noquote(paste("Leave me alone I am making",
                                                   getFavoriteBreakfast(theObject))))
                    NextMethod("makeBreakfast",theObject)
                    return(theObject)
            }
    
    makeBreakfast.Canadian <- function(theObject)
            {
                    print(noquote(paste("Good morning, how would you like",
                                           getFavoriteBreakfast(theObject))))
                    NextMethod("makeBreakfast",theObject)
                    return(theObject)
            }
    
    makeBreakfast.Anglophone <- function(theObject)
            {
                    print(noquote(paste("I hope it is okay that I am making",
                                           getFavoriteBreakfast(theObject))))
                    NextMethod("makeBreakfast",the, .toc-backrefObject)
                    return(theObject)
            }
    
    makeBreakfast.Francophone <- function(theObject)
            {
                    print(noquote(paste("Je cuisine",
                                           getFavoriteBreakfast(theObject))))
                    NextMethod("makeBreakfast",theObject)
                    return(theObject)
            }
    

    Note that the functions call the NextMethod function to call the next function in the list of classes.

    > francois <- makeBreakfast(francois)
    [1] "Calling the base makeBreakfast function"
    [1] Good morning, how would you like crepes
    [1] Je cuisine crepes
    [1] Well, this is awkward. Just make crepes
    

    It is important to note the order that the methods are called. They are called from left to right in the list of classes. Another thing to note is that when the methods run out of class names the default function is called.

    2. S4 Classes

    Contents

    We examine how to create S4 classes. It is assumed that you are familiar with the basic data types and scripting (Introduction to Programming).

    2.1. The Basic Idea

    The S4 approach differs from the S3 approach to creating a class in that it is a more rigid definition. The idea is that an object is created using the setClass command. The command takes a number of options. Many of the options are not required, but we make use of several of the optional arguments because they represent good practices with respect to object oriented programming.

    We first construct a trivial, contrived class simply to demonstrate the basic idea. Next we demonstrate how to create a method for an S4 class. This example is a little more involved than what we saw in the section on S3 classes.

    In this example, the name of the class is FirstQuadrant, and the class is used to keep track of an (x,y) coordinate pair in the first quadrant. There is a restriction that both values must be greater than or equal to zero. There are two data elements, called slots, and they are called x and y. The default values for the coordinate is the origin, x=0 and y=0.

    ######################################################################
    # Create the first quadrant class
    #
    # This is used to represent a coordinate in the first quadrant.
    FirstQuadrant <- setClass(
            # Set the name for the class
            "FirstQuadrant",
    
            # Define the slots
            slots = c(
                    x = "numeric",
                    y = "numeric"
                    ),
    
            # Set the default values for the slots. (optional)
            prototype=list(
                    x = 0.0,
                    y = 0.0
                    ),
    
            # Make a function that can test to see if the data is consistent.
            # This is not called if you have an initialize function defined!
            validity=function(object)
            {
                    if((object@x < 0) || (object@y < 0)) {
                            return("A negative number for one of the coordinates was given.")
                    }
                    return(TRUE)
            }
            )
    

    Note that the way to access one of the data elements is to use the “@” symbol. An example if given below. In the example three elements of the class defined above are created. The first uses the default values for the slots, the second overrides the defaults, and finally an attempt is made to create a coordinate in the second quadrant.

    > x <- FirstQuadrant()
    > x
    An object of class "FirstQuadrant"
    Slot "x":
    [1] 0
    
    Slot "y":
    [1] 0
    
    > y <- FirstQuadrant(x=5,y=7)
    > y
    An object of class "FirstQuadrant"
    Slot "x":
    [1] 5
    
    Slot "y":
    [1] 7
    
    > y@x
    [1] 5
    > y@y
    [1] 7
    > z <- FirstQuadrant(x=3,y=-2)
    Error in validObject(.Object) :
          invalid class “FirstQuadrant” object: A negative number for one of the coordinates was given.
    > z
    Error: object 'z' not found
    

    In the next example we create a method that is associated with the class. The method is used to set the values of a coordinate. The first step is to reserve the name using the setGeneric command, and then the setMethod command is used to define the function to be called when the first argument is an object from the FirstQuadrant class.

    # create a method to assign the value of a coordinate
    setGeneric(name="setCoordinate",
                           def=function(theObject,xVal,yVal)
                           {
                                   standardGeneric("setCoordinate")
                           }
                           )
    
    setMethod(f="setCoordinate",
                          signature="FirstQuadrant",
                          definition=function(theObject,xVal,yVal)
                          {
                                  theObject@x <- xVal
                                  theObject@y <- yVal
                                  return(theObject)
                          }
                          )
    

    It is important to note that R generally passes objects as values. For this reason the methods defined above return the updated object. When the method is called, it is used to replace the former object with the updated object.

    > z <- FirstQuadrant(x=2.5,y=10)
    > z
    An object of class "FirstQuadrant"
    Slot "x":
    [1] 2.5
    
    Slot "y":
    [1] 10
    
    > z <- setCoordinate(z,-3.0,-5.0)
    > z
    An object of class "FirstQuadrant"
    Slot "x":
    [1] -3
    
    Slot "y":
    [1] -5
    
    

    Note that the validity function given in the original class definition is not called. It is called when an object is first defined. It can be called later, but only when an explicit request is made using the validObject command.

    2.2. Creating an S4 Class

    An S4 class is created using the setClass() command. At a minimum the name of the class is specified and the names of the data elements (slots) is specified. There are a number of other options, and just as a matter of good practice we also specify a function to verify that the data is consistent (validation), and we specify the default values (the prototype). In the last section of this page, S4 inheritance, we include an additional parameter used to specify a class hierarchy.

    In this section we look at another example, and we examine some of the functions associated with S4 classes. The example we define will be used to motivate the use of methods associated with a class, and it will be used to demonstrate inheritance later. The idea is that we want to create a program to simulate a cellular automata model of a predator-prey system.

    We do not develop the whole code here but concentrate on the data structures. In particular we will create a base class for the agents. In the next section we will create the basic methods for the class. In the inheritance section we will discuss how to build on the class to create different predators and different prey species. The basic structure of the class is shown in Figure 1.

    Diagram of the base Agent class

    Figure 1.

    Diagram of the base class, Agent, used for the agents in a simulation.

    The methods for this class are defined in the following section. Here we define the class and its slots, and the code to define the class is given below:

    ######################################################################
    # Create the base Agent class
    #
    # This is used to represent the most basic agent in a simulation.
    Agent <- setClass(
            # Set the name for the class
            "Agent",
    
            # Define the slots
            slots = c(
                    location = "numeric",
                    velocity   = "numeric",
                    active   = "logical"
                    ),
    
            # Set the default values for the slots. (optional)
            prototype=list(
                    location = c(0.0,0.0),
                    active   = TRUE,
                    velocity = c(0.0,0.0)
                    ),
    
            # Make a function that can test to see if the data is consistent.
            # This is not called if you have an initialize function defined!
            validity=function(object)
            {
                    if(sum(object@velocity^2)>100.0) {
                            return("The velocity level is out of bounds.")
                    }
                    return(TRUE)
            }
            )
    

    Now that the code to define the class is given we can create an object whose class is Agent.

    > a <- Agent()
    > a
    An object of class "Agent"
    Slot "location":
    [1] 0 0
    
    Slot "velocity":
    [1] 0 0
    
    Slot "active":
    [1] TRUE
    
    

    Before we define the methods for the class a number of additional commands are explored. The first set of functions explored are the is.object and the isS4 commands. The is.object command determines whether or not a variable refers to an object. The isS4 command determines whether or not the variable is an S4 object. The reason both are required is that the isS4 command alone cannot determine if a variable is an S3 object. You need to determine if the variable is an object and then decide if it is S4 or not.

    > is.object(a)
    [1] TRUE
    > isS4(a)
    [1] TRUE
    

    The next set of commands are used to get information about the data elements, or slots, within an object. The first is the slotNames command. This command can take either an object or the name of a class. It returns the names of the slots associated with the class as strings.

    > slotNames(a)
    [1] "location" "velocity" "active"
    > slotNames("Agent")
    [1] "location" "velocity" "active"
    

    The getSlots command is similar to the slotNames command. It takes the name of a class as a string. It returns a vector whose entries are the types associated with the slots, and the names of the entries are the names of the slots.

    > getSlots("Agent")
         location  velocity    active
    "numeric" "numeric" "logical"
    > s <- getSlots("Agent")
    > s[1]
         location
    "numeric"
    > s[[1]]
    [1] "numeric"
    > names(s)
    [1] "location" "velocity" "active"
    

    The next command examined is the getClass command. It has two forms. If you give it a variable that is an S4 class it returns a list of slots for the class associated with the variable. If you give it a character string with the name of a class it gives the slots and their data types.

    > getClass(a)
    An object of class "Agent"
    Slot "location":
    [1] 0 0
    
    Slot "velocity":
    [1] 0 0
    
    Slot "active":
    [1] TRUE
    
    > getClass("Agent")
    Class "Agent" [in ".GlobalEnv"]
    
    Slots:
    
    Name:  location velocity   active
    Class:  numeric  numeric  logical
    

    The final command examined is the slot command. It can be used to get or set the value of a slot in an object. It can be used in place of the “@” operator.

    > slot(a,"location")
    [1] 0 0
    > slot(a,"location") <- c(1,5)
    > a
    An object of class "Agent"
    Slot "location":
    [1] 1 5
    
    Slot "velocity":
    [1] 0 0
    
    Slot "active":
    [1] TRUE
    
    

    2.3. Creating Methods

    We now build on the Agent class defined above. Once the class and its data elements are defined we can define the methods associated with the class. The basic idea is that if the name of a function has not been defined, the name must first be reserved using the setGeneric function. The setMethod can then be used to define which function is called based on the class names of the objects sent to it.

    We define the methods associated with the Agent method given in the previous section. Note that the validity function for an object is only called when it is first created and when an explicit call to the validObject function is made. We make use of the validObject command in the methods below that are used to change the value of a data element within an object.

    # create a method to assign the value of the location
    setGeneric(name="setLocation",
                           def=function(theObject,position)
                           {
                                   standardGeneric("setLocation")
                           }
                           )
    
    setMethod(f="setLocation",
                          signature="Agent",
                          definition=function(theObject,position)
                          {
                                  theObject@location <- position
                                  validObject(theObject)
                                  return(theObject)
                          }
                          )
    
    # create a method to get the value of the location
    setGeneric(name="getLocation",
                           def=function(theObject)
                           {
                                   standardGeneric("getLocation")
                           }
                           )
    
    setMethod(f="getLocation",
                          signature="Agent",
                          definition=function(theObject)
                          {
                                  return(theObject@location)
                          }
                          )
    
    
    # create a method to assign the value of active
    setGeneric(name="setActive",
                           def=function(theObject,active)
                           {
                                   standardGeneric("setActive")
                           }
                           )
    
    setMethod(f="setActive",
                          signature="Agent",
                          definition=function(theObject,active)
                          {
                                  theObject@active <- active
                                  validObject(theObject)
                                  return(theObject)
                          }
                          )
    
    # create a method to get the value of active
    setGeneric(name="getActive",
                           def=function(theObject)
                           {
                                   standardGeneric("getActive")
                           }
                           )
    
    setMethod(f="getActive",
                          signature="Agent",
                          definition=function(theObject)
                          {
                                  return(theObject@active)
                          }
                          )
    
    
    # create a method to assign the value of velocity
    setGeneric(name="setVelocity",
                           def=function(theObject,velocity)
                           {
                                   standardGeneric("setVelocity")
                           }
                           )
    
    setMethod(f="setVelocity",
                          signature="Agent",
                          definition=function(theObject,velocity)
                          {
                                  theObject@velocity <- velocity
                                  validObject(theObject)
                                  return(theObject)
                          }
                          )
    
    # create a method to get the value of the velocity
    setGeneric(name="getVelocity",
                           def=function(theObject)
                           {
                                   standardGeneric("getVelocity")
                           }
                           )
    
    setMethod(f="getVelocity",
                          signature="Agent",
                          definition=function(theObject)
                          {
                                  return(theObject@velocity)
                          }
                          )
    

    With these definitions the data elements are encapsulated and can be accessed and set using the methods given above. It is generally good practice in object oriented programming to keep your data private and not show them to everybody willy nilly.

    > a <- Agent()
    > getVelocity(a)
    [1] 0 0
    > a <- setVelocity(a,c(1.0,2.0))
    > getVelocity(a)
    [1] 1 2
    

    The last topic examined is the idea of overloading functions. In the examples above the signature is set to a single element. The signature is a vector of characters and specifies the data types of the argument list for the method to be defined. Here we create two new methods. The name of the method is resetActivity, and there are two versions.

    The first version accepts two arguments whose types are Agent and logical. This version of the method will set the activity slot to a given value. The second version accepts two arguments whose types are Agent and numeric. This version will set the activity to TRUE and then set the energy level to the value passed to it. Note that the names of the variables in the argument list must be exactly the same.

    # create a method to reset the velocity and the activity
    setGeneric(name="resetActivity",
                           def=function(theObject,value)
                           {
                                   standardGeneric("resetActivity")
                           }
                           )
    
    setMethod(f="resetActivity",
                          signature=c("Agent","logical"),
                          definition=function(theObject,value)
                          {
                                  theObject <- setActive(theObject,value)
                                  theObject <- setVelocity(theObject,c(0.0,0.0))
                                  return(theObject)
                          }
                          )
    
    setMethod(f="resetActivity",
                          signature=c("Agent","numeric"),
                          definition=function(theObject,value)
                          {
                                  theObject <- setActive(theObject,TRUE)
                                  theObject <- setVelocity(theObject,value)
                                  return(theObject)
                          }
                          )
    

    This definition of the function yields two options for the resetActivity function. The decision to determine which function to call depends on two arguments and their type. For example, if the first argument is from the Agent class and the second is a value of TRUE or FALSE, then the first version of the function is called. Otherwise, if the second argument is a number the second version of the function is called.

    > a <- Agent()
    > a
    An object of class "Agent"
    Slot "location":
    [1] 0 0
    
    Slot "velocity":
    [1] 0 0
    
    Slot "active":
    [1] TRUE
    
    > a <- resetActivity(a,FALSE)
    > getActive(a)
    [1] FALSE
    >  a <- resetActivity(a,c(1,3))
    > getVelocity(a)
    [1] 1 3
    

    2.4. Inheritance

    A class’ inheritance hiearchy can be specified when the class is defined using the contains option. The contains option is a vector that lists the classes the new class inherits from. In the following example we build on the Agent class defined in the previous section. The idea is that we need agents that represent a predator and two prey. We will focus on two predators for this example.

    The hierarchy for the classes is shown in Figure 2.. In this example we have one Prey class that is derived from the Agent class. There are two predator classes, Bobcat and Lynx. The Bobcat class is derived from the Agent class, and the Lynx class is derived from the Bobcat class. We will keep this very simple, and the only methods associated with the new classes is a move method. For our purposes it will only print out a message and set the values of the position and velocity to demonstrate the order of execution of the methods associated with the classes.

    Diagram of the predator and prey classes derived from the Agent class.

    Figure 2.

    Diagram of the predator and prey classes derived from the Agent class.

    The first step is to create the three new classes.

    ######################################################################
    # Create the Prey class
    #
    # This is used to represent a prey animal
    Prey <- setClass(
            # Set the name for the class
            "Prey",
    
            # Define the slots - in this case it is empty...
            slots = character(0),
    
            # Set the default values for the slots. (optional)
            prototype=list(),
    
            # Make a function that can test to see if the data is consistent.
            # This is not called if you have an initialize function defined!
            validity=function(object)
            {
                    if(sum(object@velocity^2)>70.0) {
                            return("The velocity level is out of bounds.")
                    }
                    return(TRUE)
            },
    
            # Set the inheritance for this class
            contains = "Agent"
            )
    
    
    
    ######################################################################
    # Create the Bobcat class
    #
    # This is used to represent a smaller predator
    Bobcat <- setClass(
            # Set the name for the class
            "Bobcat",
    
            # Define the slots - in this case it is empty...
            slots = character(0),
    
            # Set the default values for the slots. (optional)
            prototype=list(),
    
            # Make a function that can test to see if the data is consistent.
            # This is not called if you have an initialize function defined!
            validity=function(object)
            {
                    if(sum(object@velocity^2)>85.0) {
                            return("The velocity level is out of bounds.")
                    }
                    return(TRUE)
            },
    
            # Set the inheritance for this class
            contains = "Agent"
            )
    
    ######################################################################
    # Create the Lynx class
    #
    # This is used to represent a larger predator
    Lynx <- setClass(
            # Set the name for the class
            "Lynx",
    
            # Define the slots - in this case it is empty...
            slots = character(0),
    
            # Set the default values for the slots. (optional)
            prototype=list(),
    
            # Make a function that can test to see if the data is consistent.
            # This is not called if you have an initialize function defined!
            validity=function(object)
            {
                    if(sum(object@velocity^2)>95.0) {
                            return("The velocity level is out of bounds.")
                    }
                    return(TRUE)
            },
    
            # Set the inheritance for this class
            contains = "Bobcat"
            )
    

    The inheritance is specified using the contains option in the setClass command. Note that this can be a vector allowing for multiple inheritance. We choose not to use that to keep things simpler. If you are feeling like you need more self-loathing in your life you should try it out and experiment.

    Next we define a method, move, for the new classes. We will include methods for the Agent, Prey, Bobcat, and Lynx classes. The methods do not really do anything but are used to demonstrate the idea of how methods are executed.

    # create a method to move the agent.
    setGeneric(name="move",
                           def=function(theObject)
                           {
                                   standardGeneric("move")
                           }
                           )
    
    setMethod(f="move",
                          signature="Agent",
                          definition=function(theObject)
                          {
                                  print("Move this Agent dude")
                                  theObject <- setVelocity(theObject,c(1,2))
                                  validObject(theObject)
                                  return(theObject)
                          }
                          )
    
    setMethod(f="move",
                          signature="Prey",
                          definition=function(theObject)
                          {
                                  print("Check this Prey before moving this dude")
                                  theObject <- callNextMethod(theObject)
                                  print("Move this Prey dude")
                                  validObject(theObject)
                                  return(theObject)
                          }
                          )
    
    setMethod(f="move",
                          signature="Bobcat",
                          definition=function(theObject)
                          {
                                  print("Check this Bobcat before moving this dude")
                                  theObject <- setLocation(theObject,c(2,3))
                                  theObject <- callNextMethod(theObject)
                                  print("Move this Bobcat dude")
                                  validObject(theObject)
                                  return(theObject)
                          }
                          )
    
    setMethod(f="move",
                          signature="Lynx",
                          definition=function(theObject)
                          {
                                  print("Check this Lynx before moving this dude")
                                  theObject <- setActive(theObject,FALSE)
                                  theObject <- callNextMethod(theObject)
                                  print("Move this Lynx dude")
                                  validObject(theObject)
                                  return(theObject)
                          }
                          )
    

    There are a number of things to note. First each method calls the callNextMethod command. This command will execute the next version of the same method for the previous class in the hierarchy. Note that I have included the arguments (in the same order) as those called by the original function. Also note that the function returns a copy of the object and is used to update the object passed to the original function.

    Another thing to note is that the methods associated with the Lync, Bobcat, and Agent classes arbitrarily change the values of the position, velocity, and activity for the given object. This is done to demonstrate the changes that take place and reinforce the necessity for using the callNextMethod function the way it is used here.

    Finally, it should be noted that the validObject command is called in every method. You should try adding a print statement in the validity function. You might find that the order is a bit odd. You should experiment with this and play with it. There are times you do not get the expected results so be careful!

    We now give a brief example to demonstrate the order that the functions are called. In the example we create a Bobcat object and then call the move method. We next create a Lynx object and do the same. We print out the slots for both agents just to demonstrate the values that are changed.

    > robert <- Bobcat()
    > robert
    An object of class "Bobcat"
    Slot "location":
    [1] 0 0
    
    Slot "velocity":
    [1] 0 0
    
    Slot "active":
    [1] TRUE
    
    > robert <- move(robert)
    [1] "Check this Bobcat before moving this dude"
    [1] "Move this Agent dude"
    [1] "Move this Bobcat dude"
    > robert
    An object of class "Bobcat"
    Slot "location":
    [1] 2 3
    
    Slot "velocity":
    [1] 1 2
    
    Slot "active":
    [1] TRUE
    
    >
    >
    >
    > lionel <- Lynx()
    > lionel
    An object of class "Lynx"
    Slot "location":
    [1] 0 0
    
    Slot "velocity":
    [1] 0 0
    
    Slot "active":
    [1] TRUE
    
    > lionel <- move(lionel)
    [1] "Check this Lynx before moving this dude"
    [1] "Check this Bobcat before moving this dude"
    [1] "Move this Agent dude"
    [1] "Move this Bobcat dude"
    [1] "Move this Lynx dude"
    > lionel
    An object of class "Lynx"
    Slot "location":
    [1] 2 3
    
    Slot "velocity":
    [1] 1 2
    
    Slot "active":
    [1] FALSE
    
    

    16. Case Study: Working Through a HW Problem

    We look at a sample homework problem and the R commands necessary to explore the problem. It is assumed that you are familiar will all of the commands discussed throughout this tutorial.

    16.1. Problem Statement

    This problem comes from the 5th edition of Moore and McCabe’s Introduction to the Practice of Statistics and can be found on pp. 466-467. The data consists of the emissions of three different pollutants from 46 different engines. A copy of the data we use here is available. The problem examined here is different from that given in the book but is motivated by the discussion in the book.

    In the following examples we will look at the carbon monoxide data which is one of the columns of this data set. First we will transform the data so that it is close to being normally distributed. We will then find the confidence interval for the mean and then perform a significance test to evaluate whether or not the data is away from a fixed standard. Finally, we will find the power of the test to detect a fixed difference from that standard. We will assume that a confidence level of 95% is used throughout.

    16.2. Transforming the Data

    We first begin a basic examination of the data. A copy of the data file can be found at table_7_3.csv.The first step is to read in the file and get a summary of the center and spread of the data. In this instance we will focus only on the carbon monoxide data.

    > engine <- read.csv(file="table_7_3.csv",sep=",",head=TRUE)
    > names(engine)
    [1] "en"  "hc"  "co"  "nox"
    > summary(engine)
           en              hc               co              nox
     Min.   : 1.00   Min.   :0.3400   Min.   : 1.850   Min.   :0.490
     1st Qu.:12.75   1st Qu.:0.4375   1st Qu.: 4.388   1st Qu.:1.110
     Median :24.50   Median :0.5100   Median : 5.905   Median :1.315
     Mean   :24.00   Mean   :0.5502   Mean   : 7.879   Mean   :1.340
     3rd Qu.:35.25   3rd Qu.:0.6025   3rd Qu.:10.015   3rd Qu.:1.495
     Max.   :46.00   Max.   :1.1000   Max.   :23.530   Max.   :2.940
    

    At first glance the carbon monoxide data appears to be skewed. The spread between the third quartile and the max is five times the spread between the min and the first quartile. A boxplot is show in Figure 1. showing that the data appears to be skewed. This is further confirmed in the histogram which is shown in Figure 2. Finally, a normal qq plot is given in Figure 3. The data does not appear to be normal.

    > qqnorm(engine$co,main="Carbon Monoxide")
    > qqline(engine$co)
    > boxplot(engine$co,main="Carbon Monoxide")
    > hist(engine$co,main="Carbon Monoxide")
    > qqnorm(engine$co,main="Carbon Monoxide")
    > qqline(engine$co)
    >
    
    Boxplot of the  CO data.

    Figure 1.

    Boxplot of the Carbon Monoxide Data.
    Histogram of the CO data.

    Figure 2.

    Histogram of the Carbon Monoxide Data.
    QQPlot of the CO data

    Figure 3.

    Normal QQ Plot of the Carbon Monoxide Data.

    We next see if, .toc-backref the data can be transformed to something that is closer to being normally distributed. We examine the logarithm of the data. First, the boxplot of the log of the data appears to be more evenly distributed as shown in Figure 4. Also, the histogram appears to be centered and closer to normal in Figure 5. Finally, the normal qq plot is shown in in Figure 6. It shows that the data is more consistent with what we would expect from normal data.

    > lengine <- log(engine$co)
    > boxplot(lengine,main="Carbon Monoxide")
    > hist(lengine,main="Carbon Monoxide")
    > qqnorm(lengine,main="QQ Plot for the Log of the Carbon Monoxide")
    > qqline(lengine)
    >
    
    Boxplot of the log of the CO data.

    Figure 4.

    Boxplot of the Logarithm of the Carbon Monoxide Data.
    Histogram of the logarithm of the Carbon Monoxide Data.

    Figure 5.

    Histogram of the Logarithm of the Carbon Monoxide Data.
    QQPlot of the log of the CO data

    Figure 6.

    Normal QQ Plot of the Logarithm of the Carbon Monoxide Data.

    There is strong evidence that the logarithm of the carbon monoxide data more closely resembles a normal distribution then does the raw carbon monoxide data. For that reason all of the analysis that follows will be for the logarithm of the data and will make use of the new list “lengine.”

    16.3. The Confidence Interval

    We now find the confidence interval for the carbon monoxide data. As stated above, we will work with the logarithm of the data because it appears to be closer to a normal distribution. This data is stored in the list called “lengine.” Since we do not know the true standard deviation we will use the sample standard deviation and will use a t-distribution.

    We first find the sample mean, the sample standard deviation, and the number of observations:

    > m <- mean(, .toc-backreflengine)
    > s <- sd(lengine)
    > n <- length(lengine)
    > m
    [1] 1.883678
    > s
    [1] 0.5983851
    > n
    [1] 48
    

    Now we find the standard error:

    > se <- s/sqrt(n)
    > se
    [1] 0.08636945
    

    Finally, the margin of error is found based on a 95% confidence level which can then be used to define the confidence interval:

    > error <- se*qt(0.975,df=n-1)
    > error
    [1] 0.1737529
    > left <- m-error
    > right <- m + error
    > left
    [1] 1.709925
    > right
    [1] 2.057431
    

    The 95% confidence interval is between 1.71 and 2.06. Keep in mind that this is for the logarithm so the 95% confidence interval for the original data can be found by “undoing” the logarithm:

    > exp(left)
    [1] 5.528548
    > exp(right)
    [1] 7.82584
    

    So the 95% confidence interval for the carbon monoxide is between 5.53 and 7.83.

    16.4. Test of Significance

    We now perform a test of significance. Here we suppose that ideally the engines should have a mean level of 5.4 and do a two-sided hypothesis test. Here we assume that the true mean is labeled \(\mu_x\) and state the hypothesis test:

    \[\begin{split}H_o: \mu_x & = & 5.4,\end{split}\]\[\begin{split}H_a: \mu_x & \neq & 5.4,\end{split}\]

    To perform the hypothesis test we first assume that the null hypothesis is true and find the confidence interval around the assumed mean. Fortunately, we can use the values from the previous step:

    > lNull <- log(5.4) - error
    > rNull <- log(5.4) + error
    > lNull
    [1] 1.512646
    > rNull
    [1] 1.860152
    > m
    [1] 1.883678
    

    The sample mean lies outside of the assumed confidence interval so we can reject the null hypothesis. There is a low probability that we would have obtained our sample mean if the true mean really were 5.4.

    Another way to approach the problem would be to calculate the actual p-value for the sample mean that was found. Since the sample mean is greater than 5.4 it can be found with the following code:

    > 2*(1-pt((m-log(5.4))/se,df=n-1))
    [1] 0.02692539
    

    Since the p-value is 2.7% which is less than 5% we can reject the null hypothesis.

    Note that there is yet another way to do this. The function t.test will do a lot of this work for us.

    > t.test(lengine,mu = log(5.4),alternative = "two.sided")
    
            One Sample t-test
    
    data:  lengine
    t = 2.2841, df = 47, p-value = 0.02693
    alternative hypothesis: true mean is not equal to 1.686399
    95 percent confidence interval:
     1.709925 2.057431
    sample estimates:
    mean of x
     1.883678
    

    More information and a more complete list of the options for this command can be found using the help command:

    > help(t.test)
    

    16.5. The Power of the test

    We now find the power of the test. To find the power we need to set a level for the mean and then find the probability that we would accept the null hypothesis if the mean is really at the prescribed level. Here we will find the power to detect a difference if the level were 7. Three different methods are examined. The first is a method that some books advise to use if you do not have a non-central t-test available. The second does make use of the non-central t-test. Finally, the third method makes use of a customized R command.

    We first find the probability of accepting the null hypothesis if the level really were 7. We assume that the true mean is 7 and then find the probability that a sample mean would fall within the confidence interval if the null hypothesis were true. Keep in mind that we have to transform the level of 7 by taking its logarithm. Also keep in mind that this is a two-sided test:

    > tLeft <- (lNull-log(7))/(s/sqrt(n))
    > tRight <- (rNull-log(7))/(s/sqrt(n))
    > p <- pt(tRight,df=n-1) - pt(tLeft,df=n-1)
    > p
    [1] 0.1629119
    > 1-p
    [1] 0.8370881
    

    So the probability of making a type II error is approximately 16.3%, and the probability of detecting a difference if the level really is 7 is approximately 83.7%.

    Another way to find the power is to use a non-centrality parameter. This is the method that many books advise over the previous method. The idea is that you give it the critical t-values associated with your test and also provide a parameter that indicates how the mean is shifted.

    > t <- qt(0.975,df=n-1)
    > shift <- (log(5.4)-log(7))/(s/sqrt(n))
    > pt(t,df=n-1,ncp=shift)-pt(-t,df=n-1,ncp=shift)
    [1] 0.1628579
    > 1-(pt(t,df=n-1,ncp=shift)-pt(-t,df=n-1,ncp=shift))
    [1] 0.8371421
    

    Again, we see that the power of the test is approximately 83.7%. Note that this result is slightly off from the previous answer. This approach is often recommended over , .toc-backrefthe previous approach.

    The final approach we examine allows us to do all the calculations in one step. It makes use of the non-centrality parameter as in the previous example, but all of the commands are done for us.

    > power.t.test(n=n,delta=log(7)-log(5.4),sd=s,sig.level=0.05,
                  type="one.sample",alternative="two.sided",strict = TRUE)
    
         One-sample t test power calculation
    
                  n = 48
              delta = 0.2595112
                 sd = 0.5983851
          sig.level = 0.05
              power = 0.8371421
        alternative = two.sided
    

    This is a powerful command that can do much more than just calculate the power of a test. For example it can also be used to calculate the number of observations necessary to achieve a given power. For more information check out the help page, help(power.t.test).

    17. Case Study II: A JAMA Paper on Cholesterol

    We look at a paper that appeared in the Journal of the American Medical Association and explore how to use R to confirm the results. It is assumed that you are familiar will all of the commands discussed throughout this tutorial.

    17.1. Overview of the Paper

    The paper we examine is by Carroll et al. [Carroll2005] The goal is to confirm the results and explore some of the other results not explicitly addressed in the paper. This paper received a great deal of attention in the media. A partial list of some of the articles is given below, but many of them are now defunct:

    The authors examine the trends of several studies of cholesterol levels of Americans. The studies have been conducted in 1960-1962, 1988-1994, 1976-1980, 1988-1994, and 1999-2002. Studies of the studies previous to 1999 have indicated that overall cholesterol levels are declining. The authors of this paper focus on the changes between the two latest studies, 1988-1994 and 1999-2002. They concluded that between certain populations cholesterol levels have decreased over this time.

    One of the things that received a great deal of attention is the linkage the authors drew between lowered cholesterol levels and increased use of new drugs to lower cholesterol. Here is a quote from their conclusions:

    The increase in the proportion of adults using lipid-lowering medication, particularly in older age groups, likely contributed to the decreases in total and LDL cholesterol levels observed.

    Here we focus on the confirming the results listed in Tables 3 and 4 of the paper. We confirm the p-values given in the paper and then calculate the power of the test to detect a prescribed difference in cholesterol levels.

    17.2. The Tables

    Links to the tables in the paper are given below. Links are given to verbatim copies of the tables. For each table there are two links. The first is to a text file displaying the table. The second is to a csv file to be loaded into R. It is assumed that you have downloaded each of the csv files and made them available.

    Links to the Tables in the paper:

    Table 1 text 1 csv 1
    Table 2 text 2 csv 2
    Table 3 text 3 csv 3
    Table 4 text 4 csv 4
    Table 5 text 5 csv 5
    Table 6 text 6 csv 6

    17.3. Confirming the p-values in Table 3

    The first thing we do is confirm the p-values. The paper does not explicitly state the hypothesis test, but they use a two-sided test as we shall soon see. We will explicitly define the hypothesis test that the authors are using but first need to define some terms. We need the means for the 1988-1994 and the 1999-2002 studies and will denote them M88 and M99 respectively. We also need the standard errors and will denote them SE88 and SE99 respectively.

    In this situation we are trying to compare the means of two experiments and do not have matched pairs. With this in mind we can define our hypothesis test:

    \[\begin{split}H_0: M_{88} - M_{99} & = & 0,\end{split}\]\[\begin{split}H_a: M_{88} - M_{99} & \neq & 0,\end{split}\]

    When we assume that the hypothesis test we calculate the p-values using the following values:

    \[Sample Mean = M_{88} - M_{99},\]\[SE = \sqrt{SE_{88}^2 + SE_{99}^2}.\]

    Note that the standard errors are given in the data, and we do not have to use the number of observations to calculate the standard error. However, we do need the number of observations in calculating the p-value. The authors used a t test. There are complicated formulas used to calculate the degrees of freedom for the comparison of two means, but here we will simply find the minimum of the set of observations and subtract one.

    We first need to read in the data from table3.csv and will call the new variable t3. Note that we use a new option, row.names=”group”. This option tells R to use the entries in the “group” column as the row names. Once the table has been read we will need to make use of the means in the 1988-1994 study (t3$M.88) and the sample means in the 1999-2002 study (t3$M.99). We will also have to make use of the corresponding standard errors (t3$SE.88 and t3$SE.99) and the number of observations (t3$N.88 and t3$N.99).

    > t3 <- read.csv(file="table3.csv",header=TRUE,sep=",",row.names="group")
    > row.names(t3)
     [1] "all"    "g20"    "men"    "mg20"   "m20-29" "m30-39" "m40-49" "m50-59"
     [9] "m60-74" "m75"    "women"  "wg20"   "w20-29" "w30-39" "w40-49" "w50-59"
    [17] "w60-74" "w75"
    > names(t3)
     [1] "N.60"  "M.60"  "SE.60" "N.71"  "M.71"  "SE.71" "N.76"  "M.76"  "SE.76"
    [10] "N.88"  "M.88"  "SE.88" "N.99"  "M.99"  "SE.99" "p"
    > t3$M.88
     [1] 204 206 204 204 180 201 211 216 214 205 205 207 183 189 204 228 235 231
    > t3$M.99
     [1] 203 203 203 202 183 200 212 215 204 195 202 204 183 194 203 216 223 217
    > diff <- t3$M.88-t3$M.99
    > diff
     [1]  1  3  1  2 -3  1 -1  1 10 10  3  3  0 -5  1 12 12 14
    > se <- sqrt(t3$SE.88^2+t3$SE.99^2)
    > se
     [1] 1.140175 1.063015 1.500000 1.500000 2.195450 2.193171 3.361547 3.041381
     [9] 2.193171 3.328663 1.131371 1.063015 2.140093 1.984943 2.126029 2.483948
    [17] 2.126029 2.860070
    > deg <- pmin(t3$N.88,t3$N.99)-1
    > deg
     [1] 7739 8808 3648 4164  673  672  759  570  970  515 4090 4643  960  860  753
    [16]  568  945  552
    

    We can now calculate the t statistic. From the null hypothesis, the assumed mean of the difference is zero. We can then use the pt command to get the p-values.

    > t <- diff/se
    > t
     [1]  0.8770580  2.8221626  0.6666667  1.3333333 -1.3664626  0.4559608
     [7] -0.2974821  0.3287980  4.5596075  3.0042088  2.6516504  2.8221626
    [13]  0.0000000 -2.5189636  0.4703604  4.8310181  5.6443252  4.8949852
    > pt(t,df=deg)
     [1] 0.809758825 0.997609607 0.747486382 0.908752313 0.086125089 0.675717245
     [7] 0.383089952 0.628785421 0.999997110 0.998603837 0.995979577 0.997604809
    [13] 0.500000000 0.005975203 0.680883135 0.999999125 0.999999989
    0.999999354
    

    There are two problems with the calculation above. First, some of the t-values are positive, and for positive values we need the area under the curve to the right. There are a couple of ways to fix this, and here we will insure that the t scores are negative by taking the negative of the absolute value. The second problem is that this is a two-sided test, and we have to multiply the probability by two:

    > pt(-abs(t),df=deg)
     [1] 1.902412e-01 2.390393e-03 2.525136e-01 9.124769e-02 8.612509e-02
     [6] 3.242828e-01 3.830900e-01 3.712146e-01 2.889894e-06 1.396163e-03
    [11] 4.020423e-03 2.395191e-03 5.000000e-01 5.975203e-03 3.191169e-01
    [16] 8.748656e-07 1.095966e-08 6.462814e-07
    > 2*pt(-abs(t),df=deg)
     [1] 3.804823e-01 4.780786e-03 5.050272e-01 1.824954e-01 1.722502e-01
     [6] 6.485655e-01 7.661799e-01 7.424292e-01 5.779788e-06 2.792326e-03
    [11] 8.040845e-03 4.790382e-03 1.000000e+00 1.195041e-02 6.382337e-01
    [16] 1.749731e-06 2.191933e-08 1.292563e-06
    

    These numbers are a close match to the values given in the paper, but the output above is hard to read. We introduce a new command to loop through and print out the results in a format that is easier to read. The for loop allows you to repeat a command a specified number of times. Here we, .toc-backref want to go from 1, 2, 3, ..., to the end of the list of p-values and print out the group and associated p-value:

    > p <- 2*pt(-abs(t),df=deg)
    > for (j in 1:length(p)) {
         cat("p-value for ",row.names(t3)[j]," ",p[j],"\n");
    }
    p-value for  all   0.3804823
    p-value for  g20   0.004780786
    p-value for  men   0.5050272
    p-value for  mg20   0.1824954
    p-value for  m20-29   0.1722502
    p-value for  m30-39   0.6485655
    p-value for  m40-49   0.7661799
    p-value for  m50-59   0.7424292
    p-value for  m60-74   5.779788e-06
    p-value for  m75   0.002792326
    p-value for  women   0.008040845
    p-value for  wg20   0.004790382
    p-value for  w20-29   1
    p-value for  w30-39   0.01195041
    p-value for  w40-49   0.6382337
    p-value for  w50-59   1.749731e-06
    p-value for  w60-74   2.191933e-08
    p-value for  w75   1.292563e-06
    

    We can now compare this to Table 3 and see that we have good agreement. The differences come from a round off errors from using the truncated data in the article as well as using a different method to calculate the degrees of freedom. Note that for p-values close to zero the percent errors are very large.

    It is interesting to note that among the categories (rows) given in the table, only a small number of the differences have a p-value small enough to reject the null hypothesis at the 95% level. The differences with a p-value less than 5% are the group of all people, men from 60 to 74, men greater than 74, women from 20-74, all women, and women from the age groups of 30-39, 50-59, 60-74, and greater than 74.

    The p-values for nine out of the eighteen categories are low enough to allow us to reject the associated null hypothesis. One of those categories is for all people in the study, but very few of the male categories have significant differences at the 95% level. The majority of the differences are in the female categories especially the older age brackets.

    17.4. Confirming the p-values in Table 4

    We now confirm the p-values given in Table 4. The level of detail in the previous s, .toc-backrefection is not given, rather the commands are briefly given below:

    > t4 <- read.csv(file="table4.csv",header=TRUE,sep=",",row.names="group")
    > names(t4)
    [1] "S88N"  "S88M"  "S88SE" "S99N"  "S99M"  "S99SE" "p"
    > diff <- t4$S88M - t4$S99M
    > se <- sqrt(t4$S88SE^2+t4$S99SE^2)
    > deg <- pmin(t4$S88N,t4$S99N)-1
    > t <- diff/se
    > p <- 2*pt(-abs(t),df=deg)
    > for (j in 1:length(p)) {
         cat("p-values for ",row.names(t4)[j]," ",p[j],"\n");
    }
    p-values for  MA   0.07724362
    p-values for  MAM   0.6592499
    p-values for  MAW   0.002497728
    p-values for  NHW   0.1184228
    p-values for  NHWM   0.2673851
    p-values for  NHWW   0.02585374
    p-values for  NHB   0.001963195
    p-values for  NHBM   0.003442551
    p-values for  NHBW   0.007932079
    

    Again, the p-values are close to those given in Table 4. The numbers are off due to truncation errors from the true data as well as a simplified calculation of the degrees of freedom. As in the previous section the p-values that are close to zero have the greatest percent errors.

    17.5. Finding the Power of the Test in Table 3

    We now will find the power of the test to detect a difference. Here we arbitrarily choose to find the power to detect a difference of 4 points and then do the same for a difference of 6 points. The first step is to assume that the null hypothesis is true and find the 95% confidence interval around a difference of zero:

    > t3 <- read.csv(file="table3.csv",header=TRUE,sep=",",row.names="group")
    > se <- sqrt(t3$SE.88^2+t3$SE.99^2)
    > deg <- pmin(t3$N.88,t3$N.99)-1
    > tcut <- qt(0.975,df=deg)
    > tcut
     [1] 1.960271 1.960233 1.960614 1.960534 1.963495 1.963500 1.963094 1.964135
     [9] 1.962413 1.964581 1.960544 1.960475 1.962438 1.962726 1.963119 1.964149
    [17] 1.962477 1.964271
    

    Now that the cutoff t-scores for the 95% confidence interval have been established we want to find the probability of making a type II error. We find the probability of making a type II error if the difference is a positive 4.

    > typeII <- pt(tcut,df=deg,ncp=4/se)-pt(-tcut,df=deg,ncp=4/se)
    > typeII
     [1] 0.06083127 0.03573266 0.24009202 0.24006497 0.55583392 0.55508927
     [7] 0.77898598 0.74064782 0.55477307 0.77573784 0.05765826 0.03576160, .toc-backref
    [13] 0.53688674 0.47884787 0.53218625 0.63753092 0.53199248 0.71316969
    > 1-typeII
     [1] 0.9391687 0.9642673 0.7599080 0.7599350 0.4441661 0.4449107 0.2210140
     [8] 0.2593522 0.4452269 0.2242622 0.9423417 0.9642384 0.4631133 0.5211521
    [15] 0.4678138 0.3624691 0.4680075 0.2868303
    

    It looks like there is a mix here. Some of the tests have a very high power while others are poor. Six of the categories have very high power and four have powers less than 30% . One problem is that this is hard to read. We now show how to use the for loop to create a nicer output:

    > power <- 1-typeII
    > for (j in 1:length(power)) {
         cat("power for ",row.names(t3)[j]," ",power[j],"\n");
     }
    
    power for  all   0.9391687
    power for  g20   0.9642673
    power for  men   0.759908
    power for  mg20   0.759935
    power for  m20-29   0.4441661
    power for  m30-39   0.4449107
    power for  m40-49   0.221014
    power for  m50-59   0.2593522
    power for  m60-74   0.4452269
    power for  m75   0.2242622
    power for  women   0.9423417
    power for  wg20   0.9642384
    power for  w20-29   0.4631133
    power for  w30-39   0.5211521
    power for  w40-49   0.4678138
    power for  w50-59   0.3624691
    power for  w60-74   0.4680075
    power for  w75   0.2868303
    
    

    We see that the composite groups, groups made up of larger age groups, have much higher powers than the age stratified groups. It also appears that the groups composed of women seem to have higher powers as well.

    17.6. Differences by Race in Table 2

    We now look at the differences in the rows of LDL cholesterol in Table 2. Table 2 lists the cholesterol levels broken down by race. Here the p-values are calculated for the means for the totals rather than for every entry in the table. We will use the same two-sided hypothesis test as above, the null hypothesis is that there is no difference between the means.

    Since we are comparing means from a subset of the entries and across rows we cannot simply copy the commands from above. We will first read the data from the files and then separate the first, fourth, seventh, and tenth rows:

    > t1 <- read.csv(file='table1.csv',head=TRUE,sep=',',row.name="Group")
    > t1
           Total  HDL  LDL  STG
    AllG20  8809 8808 3867 3982
    MG20    4165 4164 1815 1893
    WG20    4644 4644 2052 2089
    AMA     2122 2122  950  994
    AMA-M    998  998  439  467
    AMA-F   1124 1124  511  527
    ANHW    4338 4337 1938 1997
    ANHW-M  2091 2090  924  965
    ANHW-F  2247 2247 1014 1032
    ANHB    1602 1602  670  674
    ANHB-M   749  749  309  312
    ANHB-W   853  853  361  362
    M20-29   674  674  304  311
    M30-39   673  673  316  323
    M40-49   760  760  318  342
    M50-59   571  571  245  262
    M60-69   671  670  287  301
    M70      816  816  345  354
    W20-29   961  961  415  419
    W30-39   861  861  374  377
    W40-49   754  755  347  352
    W50-59   569  569  256  263
    W60-69   672  671  315  324
    W70      827  827  345  354
    > t2 <- read.csv(file='table2.csv',head=T,sep=',',row.name="group")
    > ldlM  <- c(t2$LDL[1],t2$LDL[4],t2$LDL[7],t2$LDL[10])
    > ldlSE <- c(t2$LDLSE[1],t2$LDLSE[4],t2$LDLSE[7],t2$LDLSE[10])
    > ldlN  <- c(t1$LDL[1],t1$LDL[4],t1$LDL[7],t1$LDL[10])
    > ldlNames  <- c(row.names(t1)[1],row.names(t1)[4],row.names(t1)[7],row.names(t1)[10])
    > ldlM
    [1] 123 121 124 121
    > ldlSE
    [1] 1.0 1.3 1.2 1.6
    > ldlN
    [1] 3867  950 1938  670
    > ldlNames
    [1] "AllG20" "AMA"    "ANHW"   "ANHB"
    

    We can now find the approximate p-values. This is not the same as the previous examples because the means are not being compared across matching values of different lists but down the rows of a single list. We will make use of two for loops. The idea is that we will loop though each row except the last row. Then for each of these rows we make a comparison for every row beneath:

    > for (j in 1:(length(ldlM)-1)) {
        for (k in (j+1):length(ldlM)) {
           diff <- ldlM[j]-ldlM[k]
           se <- sqrt(ldlSE[j]^2+ldlSE[k]^2)
           t <- diff/se
           n <- min(ldlN[j],ldlN[k])-1
           p <- 2*pt(-abs(t),df=n)
           cat("(",j,",",k,") - ",ldlNames[j]," vs ",ldlNames[k]," p: ",p,"\n")
           }
      }
    ( 1 , 2 ) -  AllG20  vs  AMA  p:  0.2229872
    ( 1 , 3 ) -  AllG20  vs  ANHW  p:  0.5221284
    ( 1 , 4 ) -  AllG20  vs  ANHB  p:  0.2895281
    ( 2 , 3 ) -  AMA  vs  ANHW  p:  0.09027066
    ( 2 , 4 ) -  AMA  vs  ANHB  p:  1
    ( 3 , 4 ) -  ANHW  vs  ANHB  p:  0.1340862
    

    We cannot reject the null hypothesis at the 95% level for any of the differences. We cannot say that there is a significant difference between any of the groups in this comparison.

    17.7. Summary

    We examined some of the results stated in the paper Trends in Serum Lipids and Lipoproteins of Adults, 1960-2002 and confirmed the stated p-values for Tables 3 and 4. We also found the p-values for differences by some of the race categories in Table 2.

    When checking the p-values for Table 3 we found that nine of the eighteen differences are significant. When looking at the boxplot (below) for the means from Table 3 we see that two of those (the means for the two oldest categories for women) are outliers.

    > boxplot(t3$M.60,t3$M.71,t3$M.76,t3$M.88,t3$M.99,
              names=c("60-62","71-73","76-80","88-94","99-02"),
              xlab="Studies",ylab="Serum Total Cholesterol",
              main="Comparison of Serum Total Cholesterol By Study")
    
    Comparison of table 3.

    Figure 1.

    Boxplot of the results in table 3.

    The boxplot helps demonstrate what was found in the comparisons of the previous studies. There has been long term declines in cholesterol levels over the whole term of these studies. The new wrinkle in this study is the association of the use of statins to reduce blood serum cholesterol levels. The results given in Table 6 show that every category of people has shown a significant increase in the use of statins.

    One question for discussion is whether or not the association demonstrates strong enough evidence to claim a causative link between the two. The news articles that were published in response to the article imply causation between the use of statins and lower cholesterol levels. However, the study only demonstrates a positive association, and the decline in cholesterol levels may be a long term phenomena with interactions between a wide range of factors.

    Bibliography

    [Carroll2005]

    Trends in Serum Lipids and Lipoproteins of Adults, 1960-2002,

    Margaret D. Carroll, MSPH, David A. Lacher, MD, Paul D. Sorlie, PhD, James I. Cleeman, MD, David J. Gordon, MD, PhD, Michael Wolz, MS, Scott M. Grundy, MD, PhD, Clifford L. Johnson, MSPH, Journal of the American Medical Association, October 12, 2005–Vol 294, No. 14, pp: 1773-1781